import pandas as pd
Computing occupancy statistics with Python - Part 2 of 3
Occupancy by datetime
In the second part of this series, we will use Python to compute summary occupancy statistics (such as means and percentiles) by time of day, day of week, and patient category (recall that this example is from a hospital short stay unit - go back to Part 1 for all of the background info). Computation of percentiles by one or more grouping fields is a pain using tools like Excel, Access and SQL Server. With Python+pandas it’s easy.
You can find the data and the .ipynb
file in my hselab-tutorials github repo. Clone or download a zip.
Preliminaries
At the end of Part 1 of this tutorial series, we ended up with a csv file called bydate_shortstay_csv.csv. Let’s read it in and take a look at it.
## Read sample data set and convert string dates to datetimes
= pd.read_csv('data/bydate_shortstay_csv.csv',parse_dates=['datetime']) bydate_df
bydate_df.head()
category | datetime | arrivals | binofday | binofweek | dayofweek | departures | occupancy | |
---|---|---|---|---|---|---|---|---|
0 | IVT | 1996-01-02 00:00:00 | 0.0 | 0.0 | 48.0 | 1.0 | 0.0 | 0.0 |
1 | IVT | 1996-01-02 00:30:00 | 0.0 | 1.0 | 49.0 | 1.0 | 0.0 | 0.0 |
2 | IVT | 1996-01-02 01:00:00 | 0.0 | 2.0 | 50.0 | 1.0 | 0.0 | 0.0 |
3 | IVT | 1996-01-02 01:30:00 | 0.0 | 3.0 | 51.0 | 1.0 | 0.0 | 0.0 |
4 | IVT | 1996-01-02 02:00:00 | 0.0 | 4.0 | 52.0 | 1.0 | 0.0 | 0.0 |
1320:1350] bydate_df[
category | datetime | arrivals | binofday | binofweek | dayofweek | departures | occupancy | |
---|---|---|---|---|---|---|---|---|
1320 | IVT | 1996-01-29 12:00:00 | 9.0 | 24.0 | 24.0 | 0.0 | 9.0 | 21.266667 |
1321 | IVT | 1996-01-29 12:30:00 | 9.0 | 25.0 | 25.0 | 0.0 | 6.0 | 22.333333 |
1322 | IVT | 1996-01-29 13:00:00 | 12.0 | 26.0 | 26.0 | 0.0 | 12.0 | 22.266667 |
1323 | IVT | 1996-01-29 13:30:00 | 8.0 | 27.0 | 27.0 | 0.0 | 9.0 | 23.100000 |
1324 | IVT | 1996-01-29 14:00:00 | 8.0 | 28.0 | 28.0 | 0.0 | 6.0 | 22.933333 |
1325 | IVT | 1996-01-29 14:30:00 | 9.0 | 29.0 | 29.0 | 0.0 | 11.0 | 22.300000 |
1326 | IVT | 1996-01-29 15:00:00 | 7.0 | 30.0 | 30.0 | 0.0 | 6.0 | 23.900000 |
1327 | IVT | 1996-01-29 15:30:00 | 6.0 | 31.0 | 31.0 | 0.0 | 9.0 | 22.600000 |
1328 | IVT | 1996-01-29 16:00:00 | 9.0 | 32.0 | 32.0 | 0.0 | 11.0 | 19.700000 |
1329 | IVT | 1996-01-29 16:30:00 | 7.0 | 33.0 | 33.0 | 0.0 | 4.0 | 19.533333 |
1330 | IVT | 1996-01-29 17:00:00 | 5.0 | 34.0 | 34.0 | 0.0 | 9.0 | 18.033333 |
1331 | IVT | 1996-01-29 17:30:00 | 2.0 | 35.0 | 35.0 | 0.0 | 9.0 | 13.600000 |
1332 | IVT | 1996-01-29 18:00:00 | 3.0 | 36.0 | 36.0 | 0.0 | 5.0 | 9.566667 |
1333 | IVT | 1996-01-29 18:30:00 | 2.0 | 37.0 | 37.0 | 0.0 | 4.0 | 7.166667 |
1334 | IVT | 1996-01-29 19:00:00 | 1.0 | 38.0 | 38.0 | 0.0 | 3.0 | 5.833333 |
1335 | IVT | 1996-01-29 19:30:00 | 3.0 | 39.0 | 39.0 | 0.0 | 2.0 | 6.366667 |
1336 | IVT | 1996-01-29 20:00:00 | 0.0 | 40.0 | 40.0 | 0.0 | 3.0 | 4.833333 |
1337 | IVT | 1996-01-29 20:30:00 | 0.0 | 41.0 | 41.0 | 0.0 | 3.0 | 1.000000 |
1338 | IVT | 1996-01-29 21:00:00 | 0.0 | 42.0 | 42.0 | 0.0 | 0.0 | 0.000000 |
1339 | IVT | 1996-01-29 21:30:00 | 0.0 | 43.0 | 43.0 | 0.0 | 0.0 | 0.000000 |
1340 | IVT | 1996-01-29 22:00:00 | 0.0 | 44.0 | 44.0 | 0.0 | 0.0 | 0.000000 |
1341 | IVT | 1996-01-29 22:30:00 | 0.0 | 45.0 | 45.0 | 0.0 | 0.0 | 0.000000 |
1342 | IVT | 1996-01-29 23:00:00 | 0.0 | 46.0 | 46.0 | 0.0 | 0.0 | 0.000000 |
1343 | IVT | 1996-01-29 23:30:00 | 0.0 | 47.0 | 47.0 | 0.0 | 0.0 | 0.000000 |
1344 | IVT | 1996-01-30 00:00:00 | 0.0 | 0.0 | 48.0 | 1.0 | 0.0 | 0.000000 |
1345 | IVT | 1996-01-30 00:30:00 | 0.0 | 1.0 | 49.0 | 1.0 | 0.0 | 0.000000 |
1346 | IVT | 1996-01-30 01:00:00 | 0.0 | 2.0 | 50.0 | 1.0 | 0.0 | 0.000000 |
1347 | IVT | 1996-01-30 01:30:00 | 0.0 | 3.0 | 51.0 | 1.0 | 0.0 | 0.000000 |
1348 | IVT | 1996-01-30 02:00:00 | 0.0 | 4.0 | 52.0 | 1.0 | 0.0 | 0.000000 |
1349 | IVT | 1996-01-30 02:30:00 | 0.0 | 5.0 | 53.0 | 1.0 | 0.0 | 0.000000 |
With this data frame we can compute all kinds of interesting summary statistics by category, by day of week and time of day. To facilitate this type of “group by” analysis, pandas takes what is known as the Split-Apply-Combine approach. The pandas documentation has a nice discussion of this. To really understand split-apply-combine, check out the article by Hadley Wickham who created the plyr package for R. I also created a tutorial on Getting started with Python (with pandas and matplotlib) for group by analysis that covers some of the basics. A companion tutorial shows how to do the same analysis using R instead of Python.
Pandas provides a GroupBy
object to facilitate computing aggregate statistics by grouping fields.
# Create a GroupBy object for the summary stats
= bydate_df.groupby(['category','binofweek']) bydate_dfgrp1
# Having a group by object makes it easy to compute statistics such as the mean of all of the fields other than the grouping fields.
# You'll see that the result is simply another DataFrame.
bydate_dfgrp1.mean()
arrivals | binofday | dayofweek | departures | occupancy | ||
---|---|---|---|---|---|---|
category | binofweek | |||||
ART | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
1.0 | 0.000000 | 1.0 | 0.0 | 0.000000 | 0.000000 | |
2.0 | 0.000000 | 2.0 | 0.0 | 0.000000 | 0.000000 | |
3.0 | 0.000000 | 3.0 | 0.0 | 0.000000 | 0.000000 | |
4.0 | 0.000000 | 4.0 | 0.0 | 0.000000 | 0.000000 | |
... | ... | ... | ... | ... | ... | ... |
Total | 331.0 | 0.717949 | 43.0 | 6.0 | 0.538462 | 0.837607 |
332.0 | 0.179487 | 44.0 | 6.0 | 0.538462 | 0.652137 | |
333.0 | 0.717949 | 45.0 | 6.0 | 0.358974 | 0.682051 | |
334.0 | 0.179487 | 46.0 | 6.0 | 0.179487 | 0.831624 | |
335.0 | 0.179487 | 47.0 | 6.0 | 0.717949 | 0.412821 |
2016 rows × 5 columns
# Let's explore some of the means.
100:120] bydate_dfgrp1.mean()[
arrivals | binofday | dayofweek | departures | occupancy | ||
---|---|---|---|---|---|---|
category | binofweek | |||||
ART | 100.0 | 0.000000 | 4.0 | 2.0 | 0.000000 | 0.000000 |
101.0 | 0.000000 | 5.0 | 2.0 | 0.000000 | 0.000000 | |
102.0 | 0.000000 | 6.0 | 2.0 | 0.000000 | 0.000000 | |
103.0 | 0.000000 | 7.0 | 2.0 | 0.000000 | 0.000000 | |
104.0 | 0.000000 | 8.0 | 2.0 | 0.000000 | 0.000000 | |
105.0 | 0.000000 | 9.0 | 2.0 | 0.000000 | 0.000000 | |
106.0 | 0.000000 | 10.0 | 2.0 | 0.000000 | 0.000000 | |
107.0 | 1.538462 | 11.0 | 2.0 | 0.000000 | 0.782051 | |
108.0 | 1.769231 | 12.0 | 2.0 | 0.000000 | 2.361538 | |
109.0 | 3.384615 | 13.0 | 2.0 | 0.000000 | 5.058974 | |
110.0 | 1.769231 | 14.0 | 2.0 | 0.000000 | 7.674359 | |
111.0 | 1.538462 | 15.0 | 2.0 | 3.076923 | 7.584615 | |
112.0 | 1.692308 | 16.0 | 2.0 | 3.384615 | 5.225641 | |
113.0 | 1.692308 | 17.0 | 2.0 | 1.538462 | 5.300000 | |
114.0 | 2.153846 | 18.0 | 2.0 | 1.923077 | 5.282051 | |
115.0 | 1.846154 | 19.0 | 2.0 | 1.923077 | 5.412821 | |
116.0 | 1.153846 | 20.0 | 2.0 | 1.923077 | 5.228205 | |
117.0 | 1.461538 | 21.0 | 2.0 | 1.230769 | 4.800000 | |
118.0 | 1.692308 | 22.0 | 2.0 | 2.000000 | 4.764103 | |
119.0 | 2.153846 | 23.0 | 2.0 | 1.461538 | 5.064103 |
Now that we’ve seen how the a GroupBy
object works, let’s see how we can compute a whole bunch of summary statistics at once. Specifically we want to compute the mean, standard deviation, min, max and several percentiles. First let’s create a slightly different GroupBy
object.
= bydate_df.groupby(['category','dayofweek','binofday']) bydate_dfgrp2
Now let’s define a function that will return a bunch of statistics in a dictionary for a column of data.
def get_occstats(group, stub=''):
return {stub+'count': group.count(), stub+'mean': group.mean(),
+'min': group.min(),
stub+'max': group.max(), 'stdev': group.std(),
stub+'p50': group.quantile(0.5), stub+'p55': group.quantile(0.55),
stub+'p60': group.quantile(0.6), stub+'p65': group.quantile(0.65),
stub+'p70': group.quantile(0.7), stub+'p75': group.quantile(0.75),
stub+'p80': group.quantile(0.8), stub+'p85': group.quantile(0.85),
stub+'p90': group.quantile(0.9), stub+'p95': group.quantile(0.95),
stub+'p975': group.quantile(0.975),
stub+'p99': group.quantile(0.99)} stub
Now we can use the apply
function to apply the get_occstats()
function to a data series. We’ll create separate output data series for occupancy, arrivals and departures.
= bydate_dfgrp2['occupancy'].apply(get_occstats)
occ_stats = bydate_dfgrp2['arrivals'].apply(get_occstats)
arr_stats = bydate_dfgrp2['departures'].apply(get_occstats) dep_stats
So, what is occ_stats
?
type(occ_stats)
pandas.core.series.Series
It’s a pandas Series
object. What does its index look like?
occ_stats.index
MultiIndex([( 'ART', 0.0, 0.0, 'count'),
( 'ART', 0.0, 0.0, 'mean'),
( 'ART', 0.0, 0.0, 'min'),
( 'ART', 0.0, 0.0, 'max'),
( 'ART', 0.0, 0.0, 'stdev'),
( 'ART', 0.0, 0.0, 'p50'),
( 'ART', 0.0, 0.0, 'p55'),
( 'ART', 0.0, 0.0, 'p60'),
( 'ART', 0.0, 0.0, 'p65'),
( 'ART', 0.0, 0.0, 'p70'),
...
('Total', 6.0, 47.0, 'p60'),
('Total', 6.0, 47.0, 'p65'),
('Total', 6.0, 47.0, 'p70'),
('Total', 6.0, 47.0, 'p75'),
('Total', 6.0, 47.0, 'p80'),
('Total', 6.0, 47.0, 'p85'),
('Total', 6.0, 47.0, 'p90'),
('Total', 6.0, 47.0, 'p95'),
('Total', 6.0, 47.0, 'p975'),
('Total', 6.0, 47.0, 'p99')],
names=['category', 'dayofweek', 'binofday', None], length=34272)
Notice it’s a MultiIndex
with 4 levels: category, dayofweek, binofday, statistic. It would be nice to “un-pivot” the statistic from the index and have it correspond to a set of columns. That’s what unstack()
will do. It will leave us with a DataFrame
with all of the statistics as columns and a 3 level multi-index of category, dayofweek and binofday. Perfect for plotting.
occ_stats.unstack()
count | mean | min | max | stdev | p50 | p55 | p60 | p65 | p70 | p75 | p80 | p85 | p90 | p95 | p975 | p99 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
category | dayofweek | binofday | |||||||||||||||||
ART | 0.0 | 0.0 | 12.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.000000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1.0 | 12.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.000000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ||
2.0 | 12.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.000000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ||
3.0 | 12.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.000000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ||
4.0 | 12.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.000000 | 0.00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Total | 6.0 | 43.0 | 39.0 | 0.837607 | 0.0 | 4.0 | 1.185640 | 0.0 | 0.45 | 0.66 | 0.960000 | 1.00 | 1.300000 | 1.840000 | 2.000000 | 2.560000 | 3.520000 | 4.000000 | 4.000000 |
44.0 | 39.0 | 0.652137 | 0.0 | 8.0 | 1.584774 | 0.0 | 0.00 | 0.00 | 0.000000 | 0.00 | 0.000000 | 0.813333 | 1.620000 | 2.240000 | 3.400000 | 4.200000 | 6.480000 | ||
45.0 | 39.0 | 0.682051 | 0.0 | 4.0 | 1.020631 | 0.0 | 0.27 | 0.56 | 0.646667 | 0.82 | 1.100000 | 1.253333 | 1.620000 | 2.080000 | 2.733333 | 3.366667 | 3.746667 | ||
46.0 | 39.0 | 0.831624 | 0.0 | 4.0 | 1.279230 | 0.0 | 0.15 | 0.44 | 0.853333 | 1.00 | 1.000000 | 1.920000 | 2.000000 | 2.400000 | 4.000000 | 4.000000 | 4.000000 | ||
47.0 | 39.0 | 0.412821 | 0.0 | 4.0 | 0.838915 | 0.0 | 0.00 | 0.00 | 0.166667 | 0.32 | 0.466667 | 0.666667 | 0.766667 | 1.226667 | 2.066667 | 2.733333 | 3.493333 |
2016 rows × 17 columns
= occ_stats.unstack()
occ_stats_summary = arr_stats.unstack()
arr_stats_summary = dep_stats.unstack()
dep_stats_summary
200:220] # Let's peek into the middle of the table. occ_stats_summary[
count | mean | min | max | stdev | p50 | p55 | p60 | p65 | p70 | p75 | p80 | p85 | p90 | p95 | p975 | p99 | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
category | dayofweek | binofday | |||||||||||||||||
ART | 4.0 | 8.0 | 13.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
9.0 | 13.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ||
10.0 | 13.0 | 0.010256 | 0.000000 | 0.133333 | 0.036980 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.053333 | 0.093333 | 0.117333 | ||
11.0 | 13.0 | 0.325641 | 0.000000 | 1.600000 | 0.485179 | 0.000000 | 0.060000 | 0.180000 | 0.420000 | 0.553333 | 0.633333 | 0.653333 | 0.680000 | 0.720000 | 1.080000 | 1.340000 | 1.496000 | ||
12.0 | 13.0 | 1.800000 | 0.000000 | 4.500000 | 1.290923 | 1.966667 | 2.026667 | 2.146667 | 2.386667 | 2.493333 | 2.533333 | 2.713333 | 2.886667 | 3.046667 | 3.660000 | 4.080000 | 4.332000 | ||
13.0 | 13.0 | 5.143590 | 2.266667 | 7.966667 | 1.619007 | 4.933333 | 5.073333 | 5.226667 | 5.406667 | 5.840000 | 6.400000 | 6.480000 | 6.740000 | 7.360000 | 7.726667 | 7.846667 | 7.918667 | ||
14.0 | 13.0 | 8.497436 | 6.266667 | 11.166667 | 1.368661 | 7.966667 | 8.366667 | 8.733333 | 9.033333 | 9.226667 | 9.366667 | 9.466667 | 9.700000 | 10.200000 | 10.686667 | 10.926667 | 11.070667 | ||
15.0 | 13.0 | 8.707692 | 5.166667 | 10.733333 | 1.572987 | 9.200000 | 9.480000 | 9.666667 | 9.666667 | 9.680000 | 9.700000 | 9.860000 | 10.040000 | 10.260000 | 10.493333 | 10.613333 | 10.685333 | ||
16.0 | 13.0 | 6.046154 | 2.833333 | 8.666667 | 1.833815 | 6.533333 | 6.613333 | 6.680000 | 6.720000 | 6.880000 | 7.100000 | 7.480000 | 7.866667 | 8.266667 | 8.506667 | 8.586667 | 8.634667 | ||
17.0 | 13.0 | 6.279487 | 3.500000 | 9.066667 | 1.587074 | 6.333333 | 6.353333 | 6.480000 | 6.820000 | 6.960000 | 7.000000 | 7.060000 | 7.473333 | 8.593333 | 9.006667 | 9.036667 | 9.054667 | ||
18.0 | 13.0 | 6.310256 | 4.000000 | 8.800000 | 1.491729 | 6.500000 | 6.760000 | 6.966667 | 7.066667 | 7.220000 | 7.400000 | 7.620000 | 7.800000 | 7.900000 | 8.280000 | 8.540000 | 8.696000 | ||
19.0 | 13.0 | 6.074359 | 3.833333 | 7.566667 | 1.215955 | 5.966667 | 6.466667 | 6.840000 | 6.960000 | 7.053333 | 7.133333 | 7.193333 | 7.280000 | 7.420000 | 7.506667 | 7.536667 | 7.554667 | ||
20.0 | 13.0 | 5.784615 | 2.533333 | 9.500000 | 1.736142 | 5.533333 | 5.873333 | 6.153333 | 6.313333 | 6.473333 | 6.633333 | 6.753333 | 7.013333 | 7.553333 | 8.440000 | 8.970000 | 9.288000 | ||
21.0 | 13.0 | 6.271795 | 3.833333 | 9.900000 | 1.718303 | 5.700000 | 5.780000 | 5.860000 | 5.940000 | 6.566667 | 7.466667 | 7.646667 | 7.946667 | 8.486667 | 9.160000 | 9.530000 | 9.752000 | ||
22.0 | 13.0 | 6.343590 | 2.633333 | 9.000000 | 1.634092 | 6.633333 | 6.673333 | 6.793333 | 7.073333 | 7.246667 | 7.366667 | 7.526667 | 7.646667 | 7.686667 | 8.220000 | 8.610000 | 8.844000 | ||
23.0 | 13.0 | 5.692308 | 3.533333 | 9.100000 | 1.697337 | 5.100000 | 5.440000 | 5.806667 | 6.226667 | 6.660000 | 7.100000 | 7.300000 | 7.446667 | 7.486667 | 8.140000 | 8.620000 | 8.908000 | ||
24.0 | 13.0 | 5.592308 | 2.800000 | 8.900000 | 1.593921 | 5.433333 | 5.453333 | 5.566667 | 5.866667 | 6.113333 | 6.333333 | 6.633333 | 6.993333 | 7.473333 | 8.140000 | 8.520000 | 8.748000 | ||
25.0 | 13.0 | 5.282051 | 3.000000 | 8.566667 | 1.492674 | 5.166667 | 5.226667 | 5.340000 | 5.560000 | 5.753333 | 5.933333 | 6.173333 | 6.486667 | 6.946667 | 7.686667 | 8.126667 | 8.390667 | ||
26.0 | 13.0 | 4.779487 | 2.300000 | 8.333333 | 1.769627 | 4.800000 | 4.920000 | 5.106667 | 5.426667 | 5.533333 | 5.533333 | 6.013333 | 6.400000 | 6.600000 | 7.333333 | 7.833333 | 8.133333 | ||
27.0 | 13.0 | 4.817949 | 2.266667 | 8.066667 | 1.493295 | 4.833333 | 4.933333 | 5.073333 | 5.293333 | 5.393333 | 5.433333 | 5.653333 | 5.906667 | 6.226667 | 7.026667 | 7.546667 | 7.858667 |
Wouldn’t it be nice if Excel Pivot Tables could produce the output above? Why can’t they? Because they can’t do things like percentiles (or other custom aggregate functions). I love spreadsheets. I teach spreadsheet modeling. However, I find myself using either Python+pandas+matplotlib or R+plyr+ggplot2 more and more frequently for things I used to do in Excel.
Let’s fire these guys out to csv files so we can check them out and maybe play with them in spreadsheet.
'occ_stats_summary.csv')
occ_stats_summary.to_csv('arr_stats_summary.csv')
arr_stats_summary.to_csv('dep_stats_summary.csv') dep_stats_summary.to_csv(
The real reason I exported them to csv was to make it easy to read these results back in for Part 3 of this series of tutorials. In Part 3, we’ll create some plots using matplotlib based on these summary statistics.
Reuse
Citation
@online{isken2013,
author = {Mark Isken},
title = {Computing Occupancy Statistics with {Python} - {Part} 2 of 3},
date = {2013-04-08},
langid = {en}
}