Computing occupancy statistics with Python - Part 2 of 3

Occupancy by datetime

python

hillmaker

occupancy analysis

Author

Mark Isken

Published

April 8, 2013

UPDATE 2023-01-16

Hillmaker no longer works this way. See https://github.com/misken/hillmaker for latest version.

In the second part of this series, we will use Python to compute summary occupancy statistics (such as means and percentiles) by time of day, day of week, and patient category (recall that this example is from a hospital short stay unit - go back to Part 1 for all of the background info). Computation of percentiles by one or more grouping fields is a pain using tools like Excel, Access and SQL Server. With Python+pandas it’s easy.

You can find the data and the .ipynb file in my hselab-tutorials github repo. Clone or download a zip.

Preliminaries

import pandas as pd

At the end of Part 1 of this tutorial series, we ended up with a csv file called bydate_shortstay_csv.csv. Let’s read it in and take a look at it.

## Read sample data set and convert string dates to datetimes
bydate_df = pd.read_csv('data/bydate_shortstay_csv.csv',parse_dates=['datetime'])

bydate_df.head()

	category	datetime	binofday	binofweek	dayofweek
0	IVT	1996-01-02 00:00:00	0.0	48.0	1.0
1	IVT	1996-01-02 00:30:00	1.0	49.0	1.0
2	IVT	1996-01-02 01:00:00	2.0	50.0	1.0
3	IVT	1996-01-02 01:30:00	3.0	51.0	1.0
4	IVT	1996-01-02 02:00:00	4.0	52.0	1.0

bydate_df[1320:1350]

	category	datetime	arrivals	binofday	binofweek	dayofweek	departures	occupancy
1320	IVT	1996-01-29 12:00:00	9.0	24.0	24.0	0.0	9.0	21.266667
1321	IVT	1996-01-29 12:30:00	9.0	25.0	25.0	0.0	6.0	22.333333
1322	IVT	1996-01-29 13:00:00	12.0	26.0	26.0	0.0	12.0	22.266667
1323	IVT	1996-01-29 13:30:00	8.0	27.0	27.0	0.0	9.0	23.100000
1324	IVT	1996-01-29 14:00:00	8.0	28.0	28.0	0.0	6.0	22.933333
1325	IVT	1996-01-29 14:30:00	9.0	29.0	29.0	0.0	11.0	22.300000
1326	IVT	1996-01-29 15:00:00	7.0	30.0	30.0	0.0	6.0	23.900000
1327	IVT	1996-01-29 15:30:00	6.0	31.0	31.0	0.0	9.0	22.600000
1328	IVT	1996-01-29 16:00:00	9.0	32.0	32.0	0.0	11.0	19.700000
1329	IVT	1996-01-29 16:30:00	7.0	33.0	33.0	0.0	4.0	19.533333
1330	IVT	1996-01-29 17:00:00	5.0	34.0	34.0	0.0	9.0	18.033333
1331	IVT	1996-01-29 17:30:00	2.0	35.0	35.0	0.0	9.0	13.600000
1332	IVT	1996-01-29 18:00:00	3.0	36.0	36.0	0.0	5.0	9.566667
1333	IVT	1996-01-29 18:30:00	2.0	37.0	37.0	0.0	4.0	7.166667
1334	IVT	1996-01-29 19:00:00	1.0	38.0	38.0	0.0	3.0	5.833333
1335	IVT	1996-01-29 19:30:00	3.0	39.0	39.0	0.0	2.0	6.366667
1336	IVT	1996-01-29 20:00:00	0.0	40.0	40.0	0.0	3.0	4.833333
1337	IVT	1996-01-29 20:30:00	0.0	41.0	41.0	0.0	3.0	1.000000
1338	IVT	1996-01-29 21:00:00	0.0	42.0	42.0	0.0	0.0	0.000000
1339	IVT	1996-01-29 21:30:00	0.0	43.0	43.0	0.0	0.0	0.000000
1340	IVT	1996-01-29 22:00:00	0.0	44.0	44.0	0.0	0.0	0.000000
1341	IVT	1996-01-29 22:30:00	0.0	45.0	45.0	0.0	0.0	0.000000
1342	IVT	1996-01-29 23:00:00	0.0	46.0	46.0	0.0	0.0	0.000000
1343	IVT	1996-01-29 23:30:00	0.0	47.0	47.0	0.0	0.0	0.000000
1344	IVT	1996-01-30 00:00:00	0.0	0.0	48.0	1.0	0.0	0.000000
1345	IVT	1996-01-30 00:30:00	0.0	1.0	49.0	1.0	0.0	0.000000
1346	IVT	1996-01-30 01:00:00	0.0	2.0	50.0	1.0	0.0	0.000000
1347	IVT	1996-01-30 01:30:00	0.0	3.0	51.0	1.0	0.0	0.000000
1348	IVT	1996-01-30 02:00:00	0.0	4.0	52.0	1.0	0.0	0.000000
1349	IVT	1996-01-30 02:30:00	0.0	5.0	53.0	1.0	0.0	0.000000

With this data frame we can compute all kinds of interesting summary statistics by category, by day of week and time of day. To facilitate this type of “group by” analysis, pandas takes what is known as the Split-Apply-Combine approach. The pandas documentation has a nice discussion of this. To really understand split-apply-combine, check out the article by Hadley Wickham who created the plyr package for R. I also created a tutorial on Getting started with Python (with pandas and matplotlib) for group by analysis that covers some of the basics. A companion tutorial shows how to do the same analysis using R instead of Python.

Pandas provides a GroupBy object to facilitate computing aggregate statistics by grouping fields.

# Create a GroupBy object for the summary stats    
bydate_dfgrp1 = bydate_df.groupby(['category','binofweek'])

# Having a group by object makes it easy to compute statistics such as the mean of all of the fields other than the grouping fields.
# You'll see that the result is simply another DataFrame.
bydate_dfgrp1.mean()

		arrivals	binofday	dayofweek	departures	occupancy
category	binofweek
ART	0.0	0.000000	0.0	0.0	0.000000	0.000000
	1.0	0.000000	1.0	0.0	0.000000	0.000000
	2.0	0.000000	2.0	0.0	0.000000	0.000000
	3.0	0.000000	3.0	0.0	0.000000	0.000000
	4.0	0.000000	4.0	0.0	0.000000	0.000000
...	...	...	...	...	...	...
Total	331.0	0.717949	43.0	6.0	0.538462	0.837607
	332.0	0.179487	44.0	6.0	0.538462	0.652137
	333.0	0.717949	45.0	6.0	0.358974	0.682051
	334.0	0.179487	46.0	6.0	0.179487	0.831624
	335.0	0.179487	47.0	6.0	0.717949	0.412821

2016 rows × 5 columns

# Let's explore some of the means.
bydate_dfgrp1.mean()[100:120]

		arrivals	binofday	dayofweek	departures	occupancy
category	binofweek
ART	100.0	0.000000	4.0	2.0	0.000000	0.000000
	101.0	0.000000	5.0	2.0	0.000000	0.000000
	102.0	0.000000	6.0	2.0	0.000000	0.000000
	103.0	0.000000	7.0	2.0	0.000000	0.000000
	104.0	0.000000	8.0	2.0	0.000000	0.000000
	105.0	0.000000	9.0	2.0	0.000000	0.000000
	106.0	0.000000	10.0	2.0	0.000000	0.000000
	107.0	1.538462	11.0	2.0	0.000000	0.782051
	108.0	1.769231	12.0	2.0	0.000000	2.361538
	109.0	3.384615	13.0	2.0	0.000000	5.058974
	110.0	1.769231	14.0	2.0	0.000000	7.674359
	111.0	1.538462	15.0	2.0	3.076923	7.584615
	112.0	1.692308	16.0	2.0	3.384615	5.225641
	113.0	1.692308	17.0	2.0	1.538462	5.300000
	114.0	2.153846	18.0	2.0	1.923077	5.282051
	115.0	1.846154	19.0	2.0	1.923077	5.412821
	116.0	1.153846	20.0	2.0	1.923077	5.228205
	117.0	1.461538	21.0	2.0	1.230769	4.800000
	118.0	1.692308	22.0	2.0	2.000000	4.764103
	119.0	2.153846	23.0	2.0	1.461538	5.064103

Now that we’ve seen how the a GroupBy object works, let’s see how we can compute a whole bunch of summary statistics at once. Specifically we want to compute the mean, standard deviation, min, max and several percentiles. First let’s create a slightly different GroupBy object.

bydate_dfgrp2 = bydate_df.groupby(['category','dayofweek','binofday'])

Now let’s define a function that will return a bunch of statistics in a dictionary for a column of data.

def get_occstats(group, stub=''):
    return {stub+'count': group.count(), stub+'mean': group.mean(), 
            stub+'min': group.min(),
            stub+'max': group.max(), 'stdev': group.std(), 
            stub+'p50': group.quantile(0.5), stub+'p55': group.quantile(0.55),
            stub+'p60': group.quantile(0.6), stub+'p65': group.quantile(0.65),
            stub+'p70': group.quantile(0.7), stub+'p75': group.quantile(0.75),
            stub+'p80': group.quantile(0.8), stub+'p85': group.quantile(0.85),
            stub+'p90': group.quantile(0.9), stub+'p95': group.quantile(0.95),
            stub+'p975': group.quantile(0.975), 
            stub+'p99': group.quantile(0.99)}

Now we can use the apply function to apply the get_occstats() function to a data series. We’ll create separate output data series for occupancy, arrivals and departures.

occ_stats = bydate_dfgrp2['occupancy'].apply(get_occstats)
arr_stats = bydate_dfgrp2['arrivals'].apply(get_occstats)
dep_stats = bydate_dfgrp2['departures'].apply(get_occstats)

So, what is occ_stats?

type(occ_stats)

pandas.core.series.Series

It’s a pandas Series object. What does its index look like?

occ_stats.index

MultiIndex([(  'ART', 0.0,  0.0, 'count'),
            (  'ART', 0.0,  0.0,  'mean'),
            (  'ART', 0.0,  0.0,   'min'),
            (  'ART', 0.0,  0.0,   'max'),
            (  'ART', 0.0,  0.0, 'stdev'),
            (  'ART', 0.0,  0.0,   'p50'),
            (  'ART', 0.0,  0.0,   'p55'),
            (  'ART', 0.0,  0.0,   'p60'),
            (  'ART', 0.0,  0.0,   'p65'),
            (  'ART', 0.0,  0.0,   'p70'),
            ...
            ('Total', 6.0, 47.0,   'p60'),
            ('Total', 6.0, 47.0,   'p65'),
            ('Total', 6.0, 47.0,   'p70'),
            ('Total', 6.0, 47.0,   'p75'),
            ('Total', 6.0, 47.0,   'p80'),
            ('Total', 6.0, 47.0,   'p85'),
            ('Total', 6.0, 47.0,   'p90'),
            ('Total', 6.0, 47.0,   'p95'),
            ('Total', 6.0, 47.0,  'p975'),
            ('Total', 6.0, 47.0,   'p99')],
           names=['category', 'dayofweek', 'binofday', None], length=34272)

Notice it’s a MultiIndex with 4 levels: category, dayofweek, binofday, statistic. It would be nice to “un-pivot” the statistic from the index and have it correspond to a set of columns. That’s what unstack() will do. It will leave us with a DataFrame with all of the statistics as columns and a 3 level multi-index of category, dayofweek and binofday. Perfect for plotting.

occ_stats.unstack()

			count	mean	min	max	stdev	p50	p55	p60	p65	p70	p75	p80	p85	p90	p95	p975	p99
category	dayofweek	binofday
ART	0.0	0.0	12.0	0.000000	0.0	0.0	0.000000	0.0	0.00	0.00	0.000000	0.00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
		1.0	12.0	0.000000	0.0	0.0	0.000000	0.0	0.00	0.00	0.000000	0.00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
		2.0	12.0	0.000000	0.0	0.0	0.000000	0.0	0.00	0.00	0.000000	0.00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
		3.0	12.0	0.000000	0.0	0.0	0.000000	0.0	0.00	0.00	0.000000	0.00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
		4.0	12.0	0.000000	0.0	0.0	0.000000	0.0	0.00	0.00	0.000000	0.00	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Total	6.0	43.0	39.0	0.837607	0.0	4.0	1.185640	0.0	0.45	0.66	0.960000	1.00	1.300000	1.840000	2.000000	2.560000	3.520000	4.000000	4.000000
		44.0	39.0	0.652137	0.0	8.0	1.584774	0.0	0.00	0.00	0.000000	0.00	0.000000	0.813333	1.620000	2.240000	3.400000	4.200000	6.480000
		45.0	39.0	0.682051	0.0	4.0	1.020631	0.0	0.27	0.56	0.646667	0.82	1.100000	1.253333	1.620000	2.080000	2.733333	3.366667	3.746667
		46.0	39.0	0.831624	0.0	4.0	1.279230	0.0	0.15	0.44	0.853333	1.00	1.000000	1.920000	2.000000	2.400000	4.000000	4.000000	4.000000
		47.0	39.0	0.412821	0.0	4.0	0.838915	0.0	0.00	0.00	0.166667	0.32	0.466667	0.666667	0.766667	1.226667	2.066667	2.733333	3.493333

2016 rows × 17 columns

occ_stats_summary = occ_stats.unstack()
arr_stats_summary = arr_stats.unstack()
dep_stats_summary = dep_stats.unstack()

occ_stats_summary[200:220] # Let's peek into the middle of the table.

			count	mean	min	max	stdev	p50	p55	p60	p65	p70	p75	p80	p85	p90	p95	p975	p99
category	dayofweek	binofday
ART	4.0	8.0	13.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
		9.0	13.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
		10.0	13.0	0.010256	0.000000	0.133333	0.036980	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.053333	0.093333	0.117333
		11.0	13.0	0.325641	0.000000	1.600000	0.485179	0.000000	0.060000	0.180000	0.420000	0.553333	0.633333	0.653333	0.680000	0.720000	1.080000	1.340000	1.496000
		12.0	13.0	1.800000	0.000000	4.500000	1.290923	1.966667	2.026667	2.146667	2.386667	2.493333	2.533333	2.713333	2.886667	3.046667	3.660000	4.080000	4.332000
		13.0	13.0	5.143590	2.266667	7.966667	1.619007	4.933333	5.073333	5.226667	5.406667	5.840000	6.400000	6.480000	6.740000	7.360000	7.726667	7.846667	7.918667
		14.0	13.0	8.497436	6.266667	11.166667	1.368661	7.966667	8.366667	8.733333	9.033333	9.226667	9.366667	9.466667	9.700000	10.200000	10.686667	10.926667	11.070667
		15.0	13.0	8.707692	5.166667	10.733333	1.572987	9.200000	9.480000	9.666667	9.666667	9.680000	9.700000	9.860000	10.040000	10.260000	10.493333	10.613333	10.685333
		16.0	13.0	6.046154	2.833333	8.666667	1.833815	6.533333	6.613333	6.680000	6.720000	6.880000	7.100000	7.480000	7.866667	8.266667	8.506667	8.586667	8.634667
		17.0	13.0	6.279487	3.500000	9.066667	1.587074	6.333333	6.353333	6.480000	6.820000	6.960000	7.000000	7.060000	7.473333	8.593333	9.006667	9.036667	9.054667
		18.0	13.0	6.310256	4.000000	8.800000	1.491729	6.500000	6.760000	6.966667	7.066667	7.220000	7.400000	7.620000	7.800000	7.900000	8.280000	8.540000	8.696000
		19.0	13.0	6.074359	3.833333	7.566667	1.215955	5.966667	6.466667	6.840000	6.960000	7.053333	7.133333	7.193333	7.280000	7.420000	7.506667	7.536667	7.554667
		20.0	13.0	5.784615	2.533333	9.500000	1.736142	5.533333	5.873333	6.153333	6.313333	6.473333	6.633333	6.753333	7.013333	7.553333	8.440000	8.970000	9.288000
		21.0	13.0	6.271795	3.833333	9.900000	1.718303	5.700000	5.780000	5.860000	5.940000	6.566667	7.466667	7.646667	7.946667	8.486667	9.160000	9.530000	9.752000
		22.0	13.0	6.343590	2.633333	9.000000	1.634092	6.633333	6.673333	6.793333	7.073333	7.246667	7.366667	7.526667	7.646667	7.686667	8.220000	8.610000	8.844000
		23.0	13.0	5.692308	3.533333	9.100000	1.697337	5.100000	5.440000	5.806667	6.226667	6.660000	7.100000	7.300000	7.446667	7.486667	8.140000	8.620000	8.908000
		24.0	13.0	5.592308	2.800000	8.900000	1.593921	5.433333	5.453333	5.566667	5.866667	6.113333	6.333333	6.633333	6.993333	7.473333	8.140000	8.520000	8.748000
		25.0	13.0	5.282051	3.000000	8.566667	1.492674	5.166667	5.226667	5.340000	5.560000	5.753333	5.933333	6.173333	6.486667	6.946667	7.686667	8.126667	8.390667
		26.0	13.0	4.779487	2.300000	8.333333	1.769627	4.800000	4.920000	5.106667	5.426667	5.533333	5.533333	6.013333	6.400000	6.600000	7.333333	7.833333	8.133333
		27.0	13.0	4.817949	2.266667	8.066667	1.493295	4.833333	4.933333	5.073333	5.293333	5.393333	5.433333	5.653333	5.906667	6.226667	7.026667	7.546667	7.858667

Wouldn’t it be nice if Excel Pivot Tables could produce the output above? Why can’t they? Because they can’t do things like percentiles (or other custom aggregate functions). I love spreadsheets. I teach spreadsheet modeling. However, I find myself using either Python+pandas+matplotlib or R+plyr+ggplot2 more and more frequently for things I used to do in Excel.

Let’s fire these guys out to csv files so we can check them out and maybe play with them in spreadsheet.

occ_stats_summary.to_csv('occ_stats_summary.csv')
arr_stats_summary.to_csv('arr_stats_summary.csv')
dep_stats_summary.to_csv('dep_stats_summary.csv')

The real reason I exported them to csv was to make it easy to read these results back in for Part 3 of this series of tutorials. In Part 3, we’ll create some plots using matplotlib based on these summary statistics.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{isken2013,
  author = {Mark Isken},
  title = {Computing Occupancy Statistics with {Python} - {Part} 2 of 3},
  date = {2013-04-08},
  langid = {en}
}

For attribution, please cite this work as:

Mark Isken. 2013. “Computing Occupancy Statistics with Python - Part 2 of 3.” April 8, 2013.