Computing daily averages from transaction data using pandas can be tricky - Part 2

Now let’s add a grouping field

python
pandas
bikeshare
Author

Mark Isken

Published

July 18, 2018

In Part 1 of this post, we discuss the importance of taking into account days with zero activity as well as carefully specifiying the date range for analysis when computing things like the average number of bike trips by day in a bike share system. Obviously, this applies more generally to computing averages number of events per day. We showed an approach to dealing with these complications.

In Part 2 of this post, we’ll extend this to cases in which we might want to group by an additional field (e.g. station_id for the bike share example) and see how this problem is just one part of the general problem of doing occupancy analysis based on transaction data. We’ll see how the hillmaker package can make these types of analyses easier.

import pandas as pd
trip = pd.read_csv('trip.csv', parse_dates = ['starttime', 'stoptime'])
trip['tripdate'] = trip['starttime'].map(lambda x: x.date())
trip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286857 entries, 0 to 286856
Data columns (total 13 columns):
trip_id              286857 non-null int64
starttime            286857 non-null datetime64[ns]
stoptime             286857 non-null datetime64[ns]
bikeid               286857 non-null object
tripduration         286857 non-null float64
from_station_name    286857 non-null object
to_station_name      286857 non-null object
from_station_id      286857 non-null object
to_station_id        286857 non-null object
usertype             286857 non-null object
gender               181557 non-null object
birthyear            181553 non-null float64
tripdate             286857 non-null object
dtypes: datetime64[ns](2), float64(2), int64(1), object(8)
memory usage: 28.5+ MB
trip.head()
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear tripdate
0 431 2014-10-13 10:31:00 2014-10-13 10:48:00 SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0 2014-10-13
1 432 2014-10-13 10:32:00 2014-10-13 10:48:00 SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0 2014-10-13
2 433 2014-10-13 10:33:00 2014-10-13 10:48:00 SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0 2014-10-13
3 434 2014-10-13 10:34:00 2014-10-13 10:48:00 SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0 2014-10-13
4 435 2014-10-13 10:34:00 2014-10-13 10:49:00 SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0 2014-10-13

What is the average number of bike rentals per day from each station?

Not only do we need to take into account days in which no bikes were rented from each of the stations, we have to decide how we want to handle the fact that not all stations were open for the period of time represented in the trip dataframe.

print("The first trip was at {}.".format(trip.starttime.min()))
print("The last trip was at {}.".format(trip.starttime.max()))
The first trip was at 2014-10-13 10:31:00.
The last trip was at 2016-08-31 23:49:00.

We can get a sense of the timeframes for when individual stations were open by looking an min and max of the starttime field.

trip.groupby('from_station_id')['starttime'].aggregate(['min', 'max'])
min max
from_station_id
8D OPS 02 2016-08-12 09:03:00 2016-08-12 09:06:00
BT-01 2014-10-13 15:39:00 2016-08-31 22:13:00
BT-03 2014-10-13 12:35:00 2016-08-31 17:21:00
BT-04 2014-10-13 12:05:00 2016-08-31 19:31:00
BT-05 2014-10-13 13:08:00 2016-08-31 18:21:00
CBD-03 2014-10-13 13:30:00 2016-08-31 19:36:00
CBD-04 2015-07-27 15:52:00 2016-08-31 20:44:00
CBD-05 2014-10-13 11:52:00 2016-08-31 17:26:00
CBD-06 2014-10-13 10:31:00 2016-08-31 20:32:00
CBD-07 2014-10-13 11:49:00 2016-08-31 18:23:00
CBD-13 2014-10-13 12:58:00 2016-08-31 19:58:00
CD-01 2015-05-23 08:01:00 2016-08-09 18:43:00
CH-01 2014-10-13 16:31:00 2016-08-31 22:37:00
CH-02 2014-10-13 11:54:00 2016-08-31 21:59:00
CH-03 2014-10-13 17:16:00 2016-08-31 20:33:00
CH-05 2014-10-13 13:45:00 2016-08-31 20:43:00
CH-06 2014-10-13 15:33:00 2016-08-31 15:44:00
CH-07 2014-10-13 14:06:00 2016-08-31 19:27:00
CH-08 2014-10-13 13:32:00 2016-08-31 23:49:00
CH-09 2014-10-13 13:01:00 2016-08-31 23:34:00
CH-12 2014-10-13 17:51:00 2016-08-31 18:30:00
CH-15 2014-10-13 13:08:00 2016-08-31 20:57:00
CH-16 2016-03-19 01:06:00 2016-08-31 17:38:00
DPD-01 2014-10-13 11:59:00 2016-08-31 18:57:00
DPD-03 2014-10-13 13:43:00 2016-08-31 19:32:00
EL-01 2014-10-13 15:03:00 2016-08-31 19:03:00
EL-03 2014-10-13 18:19:00 2016-08-31 17:35:00
EL-05 2014-10-13 15:05:00 2016-08-30 18:23:00
FH-01 2014-10-13 12:59:00 2016-03-18 17:19:00
FH-04 2014-10-13 18:44:00 2016-08-31 20:58:00
... ... ...
PS-04 2014-10-13 11:35:00 2016-08-31 18:10:00
PS-05 2014-10-13 11:51:00 2016-08-31 16:47:00
Pronto shop 2015-06-02 20:36:00 2015-06-02 20:36:00
Pronto shop 2 2016-05-17 12:03:00 2016-05-17 15:41:00
SLU-01 2014-10-13 13:09:00 2016-08-31 18:54:00
SLU-02 2014-10-13 13:51:00 2016-08-31 18:02:00
SLU-04 2014-10-13 15:34:00 2016-08-31 21:27:00
SLU-07 2014-10-13 19:11:00 2016-08-31 18:49:00
SLU-15 2014-10-13 13:02:00 2016-08-31 18:57:00
SLU-16 2014-10-13 20:57:00 2016-08-31 16:10:00
SLU-17 2014-10-13 14:53:00 2016-08-31 19:17:00
SLU-18 2014-10-13 13:02:00 2016-07-02 11:09:00
SLU-19 2014-10-13 13:38:00 2016-08-31 22:02:00
SLU-20 2015-06-12 15:13:00 2016-08-31 18:36:00
SLU-21 2015-09-19 16:24:00 2016-08-30 18:41:00
SLU-22 2016-07-03 19:35:00 2016-08-31 14:20:00
UD-01 2014-10-13 12:02:00 2016-08-31 17:32:00
UD-02 2014-10-13 12:48:00 2016-08-31 08:07:00
UD-04 2014-10-13 12:54:00 2016-08-31 09:24:00
UD-07 2014-10-13 13:20:00 2016-08-31 08:48:00
UW-01 2014-10-14 16:35:00 2015-10-27 17:26:00
UW-02 2014-10-13 12:15:00 2016-08-31 15:08:00
UW-04 2014-10-13 14:49:00 2016-08-31 18:43:00
UW-06 2014-10-13 14:21:00 2016-08-30 15:26:00
UW-07 2014-10-13 12:37:00 2016-08-30 20:44:00
UW-10 2014-10-13 13:04:00 2016-08-30 13:50:00
UW-11 2015-11-05 16:28:00 2016-08-31 21:35:00
WF-01 2014-10-13 13:45:00 2016-08-31 22:47:00
WF-03 2016-08-10 10:36:00 2016-08-31 17:52:00
WF-04 2014-10-13 12:04:00 2016-08-31 22:10:00

61 rows × 2 columns

Let’s look at the station.csv file so that we can see the installation and decommision dates.

station = pd.read_csv('station.csv', 
                      parse_dates = ['install_date', 'modification_date', 'decommission_date'])
station.head(15)
station_id name lat long install_date install_dockcount modification_date current_dockcount decommission_date
0 BT-01 3rd Ave & Broad St 47.618418 -122.350964 2014-10-13 18 NaT 18 NaT
1 BT-03 2nd Ave & Vine St 47.615829 -122.348564 2014-10-13 16 NaT 16 NaT
2 BT-04 6th Ave & Blanchard St 47.616094 -122.341102 2014-10-13 16 NaT 16 NaT
3 BT-05 2nd Ave & Blanchard St 47.613110 -122.344208 2014-10-13 14 NaT 14 NaT
4 CBD-03 7th Ave & Union St 47.610731 -122.332447 2014-10-13 20 NaT 20 NaT
5 CBD-04 Union St & 4th Ave 47.609221 -122.335596 2015-07-27 18 NaT 18 NaT
6 CBD-05 1st Ave & Marion St 47.604058 -122.335800 2014-10-13 20 NaT 20 NaT
7 CBD-06 2nd Ave & Spring St 47.605950 -122.335768 2014-10-13 20 2015-11-09 18 NaT
8 CBD-07 City Hall / 4th Ave & James St 47.603509 -122.330409 2014-10-13 20 NaT 20 NaT
9 CBD-13 2nd Ave & Pine St 47.610185 -122.339641 2014-10-13 18 NaT 18 NaT
10 CD-01 12th Ave & E Yesler Way 47.602103 -122.316923 2015-05-22 16 2016-08-09 0 2016-08-09
11 CH-01 Summit Ave & E Denny Way 47.618633 -122.325249 2014-10-13 16 NaT 16 NaT
12 CH-02 E Harrison St & Broadway Ave E 47.622063 -122.321251 2014-10-13 18 2015-02-24 20 NaT
13 CH-03 Summit Ave E & E Republican St 47.623367 -122.325279 2014-10-13 16 NaT 16 NaT
14 CH-05 15th Ave E & E Thomas St 47.620712 -122.312805 2014-10-13 16 NaT 16 NaT

To simplify things, let’s only consider trips for stations who have not been decommissioned.

# Create a list of stations to consider for the analysis
active_stations = list(station[station.decommission_date.isna()]['station_id'])
# Only keep records for active stations
trip = trip[trip.from_station_id.isin(active_stations)]

Now let’s create a DataFrame containing the analysis start and end dates for each active stations. For the start date we’ll use the install_date field and for end date we’ll use the last date in the trip data - 2016-08-31.

# Just grab the rows of active stations
station_dates = station[station.decommission_date.isna()]
# Just grab columns of interest
station_dates = station_dates[['station_id','install_date','decommission_date']]
# Rename the date columns
station_dates.rename(columns={'install_date': 'start_date', 'decommission_date': 'end_date'},
                    inplace=True)
# Set end date for all stations
station_dates.end_date = pd.datetime(2016, 8, 31)
# Check out the result
station_dates.head(7)
station_id start_date end_date
0 BT-01 2014-10-13 2016-08-31
1 BT-03 2014-10-13 2016-08-31
2 BT-04 2014-10-13 2016-08-31
3 BT-05 2014-10-13 2016-08-31
4 CBD-03 2014-10-13 2016-08-31
5 CBD-04 2015-07-27 2016-08-31
6 CBD-05 2014-10-13 2016-08-31

In Part 1 we outlined the basic strategy for computing the average number of daily events when some days might not have any events.

  1. Create a range of dates based on the start and end date of the time period of interest.
  2. Create an empty DataFrame using the range of dates as the index. For convenience, add weekday column based on date to facility day of week analysis. Let’s call this new DataFrame the “seeded” DataFrame
  3. Use groupby on the original trip DataFrame to compute number of trips by date and store result as DataFrame
  4. Merge the two DataFrames on their indexes (tripdate) but do a “left join”. Pandas merge function is perfect for this.
  5. If there are dates with no trips, they’ll have missing data in the new merged DataFrame. Update the missing values to 0 using the fillna function in pandas.
  6. Now you can compute overall mean number of trips per day. You could also compute means by day of week.

In order to compute the average number of trips per day by station, a few details in the above process have to be generalized.

  • In Step 1, the start and end dates can depend on the station.
  • In Step 2, we need a DataFrame containing each station and its associated date range.
  • In Step 3, we need to group by both station and trip date.

Accounting for zero days and station specific analysis timeframe

Steps 1 and 2: Create an empty DataFrame using stations and associated range of dates as the MultiIndex.

We’ll construct the desired MultiIndex from a list of tuples of station ids and dates. To create the list of tuples we can use a list comprehension to iterate over the rows of the station_dates dataframe and associated date range. Gotta love list comprehensions.

index_tuples = [(row.station_id, d) for index, row in station_dates.iterrows() 
                         for d in pd.date_range(row.start_date, row.end_date)]

print(index_tuples[:5])
[('BT-01', Timestamp('2014-10-13 00:00:00', freq='D')), ('BT-01', Timestamp('2014-10-14 00:00:00', freq='D')), ('BT-01', Timestamp('2014-10-15 00:00:00', freq='D')), ('BT-01', Timestamp('2014-10-16 00:00:00', freq='D')), ('BT-01', Timestamp('2014-10-17 00:00:00', freq='D'))]

Create the empty DataFrame and add a weekday column for convenience.

trips_by_station_date_seeded = pd.DataFrame(index=pd.MultiIndex.from_tuples(index_tuples))
# Add weekday column to new dataframe
trips_by_station_date_seeded['weekday'] = \
    trips_by_station_date_seeded.index.get_level_values(1).map(lambda x: x.weekday())

trips_by_station_date_seeded.index.names = ['from_station_id', 'tripdate']
print(trips_by_station_date_seeded.head())
print(trips_by_station_date_seeded.tail())
                            weekday
from_station_id tripdate           
BT-01           2014-10-13        0
                2014-10-14        1
                2014-10-15        2
                2014-10-16        3
                2014-10-17        4
                            weekday
from_station_id tripdate           
WF-03           2016-08-27        5
                2016-08-28        6
                2016-08-29        0
                2016-08-30        1
                2016-08-31        2

Step 3: Use groupby on the original trip DataFrame to compute number of trips by station by date

# Create a Group by object using from_station_id and tripdate
grp_station_date = trip.groupby(['from_station_id', 'tripdate'])

# Compute number of trips by date and check out the result
trips_by_station_date = pd.DataFrame(grp_station_date.size(), columns=['num_trips'])
print(trips_by_station_date.head())
                            num_trips
from_station_id tripdate             
BT-01           2014-10-13         20
                2014-10-14         28
                2014-10-15          8
                2014-10-16         24
                2014-10-17         16

Step 4: Merge the two DataFrames on their indexes but do a “left join”.

Pandas merge function is perfect for this. The left_index=True, right_index=True are telling pandas to use those respective indexes as the joining columns. Check out the documentation for merge().

# Merge the two dataframes doing a left join (with seeded table on left)
trips_by_station_date_merged = pd.merge(trips_by_station_date_seeded, 
                                        trips_by_station_date, how='left', 
                                        left_index=True, right_index=True, sort=True)
trips_by_station_date_merged[100:125]
weekday num_trips
from_station_id tripdate
BT-01 2015-01-21 2 2.0
2015-01-22 3 8.0
2015-01-23 4 12.0
2015-01-24 5 28.0
2015-01-25 6 20.0
2015-01-26 0 12.0
2015-01-27 1 12.0
2015-01-28 2 10.0
2015-01-29 3 14.0
2015-01-30 4 10.0
2015-01-31 5 14.0
2015-02-01 6 2.0
2015-02-02 0 2.0
2015-02-03 1 6.0
2015-02-04 2 8.0
2015-02-05 3 2.0
2015-02-06 4 NaN
2015-02-07 5 14.0
2015-02-08 6 14.0
2015-02-09 0 10.0
2015-02-10 1 8.0
2015-02-11 2 10.0
2015-02-12 3 18.0
2015-02-13 4 30.0
2015-02-14 5 52.0

Step 5: Replace missing values with zeroes for those dates with no trips

If there are dates with no trips, they’ll have missing data in the new merged DataFrame. Update the missing values to 0 using the fillna function in pandas. Remember, these are the instances of “zero activity” that would cause biased (high) estimates of the average number of daily trips.

# How many missing values do we have?
trips_by_station_date_merged['num_trips'].isna().sum()
3072
# Fill in any missing values with 0.
trips_by_station_date_merged['num_trips'] = trips_by_station_date_merged['num_trips'].fillna(0)
trips_by_station_date_merged[100:125]
weekday num_trips
from_station_id tripdate
BT-01 2015-01-21 2 2.0
2015-01-22 3 8.0
2015-01-23 4 12.0
2015-01-24 5 28.0
2015-01-25 6 20.0
2015-01-26 0 12.0
2015-01-27 1 12.0
2015-01-28 2 10.0
2015-01-29 3 14.0
2015-01-30 4 10.0
2015-01-31 5 14.0
2015-02-01 6 2.0
2015-02-02 0 2.0
2015-02-03 1 6.0
2015-02-04 2 8.0
2015-02-05 3 2.0
2015-02-06 4 0.0
2015-02-07 5 14.0
2015-02-08 6 14.0
2015-02-09 0 10.0
2015-02-10 1 8.0
2015-02-11 2 10.0
2015-02-12 3 18.0
2015-02-13 4 30.0
2015-02-14 5 52.0

Step 6: Compute statistics of interest

Now we can safely compute means and other statistics of interest for the num_trips column after grouping by from_station_id. Since we have taken the zero activity days into account, the stations having count values less than 689 are those stations which were installed sometime after 2014-10-13 (the installation date is in the station table).

trips_by_station_date_merged.groupby('from_station_id')['num_trips'].describe()
count mean std min 25% 50% 75% max
from_station_id
BT-01 689.0 15.869376 10.729919 0.0 8.0 14.0 22.00 62.0
BT-03 689.0 11.854862 7.494522 0.0 6.0 10.0 16.00 52.0
BT-04 689.0 6.798258 4.590017 0.0 3.0 6.0 10.00 26.0
BT-05 689.0 9.100145 6.731142 0.0 4.0 8.0 13.00 42.0
CBD-03 689.0 8.062409 7.649699 0.0 3.0 6.0 10.00 57.0
CBD-04 402.0 6.699005 4.044817 0.0 4.0 7.0 9.00 22.0
CBD-05 689.0 8.165457 5.902446 0.0 4.0 7.0 12.00 50.0
CBD-06 689.0 7.930334 6.506778 0.0 3.0 7.0 11.00 52.0
CBD-07 689.0 5.271408 5.189189 0.0 2.0 4.0 8.00 72.0
CBD-13 689.0 14.584906 9.699203 0.0 7.0 13.0 21.00 82.0
CH-01 689.0 9.822932 5.693529 0.0 6.0 9.0 12.00 38.0
CH-02 689.0 13.989840 7.987720 0.0 8.0 13.0 18.00 48.0
CH-03 689.0 9.544267 5.981427 0.0 5.0 8.0 13.00 36.0
CH-05 689.0 11.146589 6.491743 0.0 7.0 10.0 15.00 40.0
CH-06 689.0 5.568940 3.525442 0.0 3.0 5.0 8.00 26.0
CH-07 689.0 16.534107 7.045270 0.0 12.0 16.0 21.00 40.0
CH-08 689.0 13.741655 8.355998 0.0 8.0 12.0 19.00 68.0
CH-09 689.0 8.269956 5.721211 0.0 4.0 7.0 11.00 54.0
CH-12 689.0 8.052250 5.954298 0.0 4.0 7.0 12.00 34.0
CH-15 689.0 9.175617 5.059781 0.0 6.0 8.0 12.00 34.0
CH-16 167.0 7.712575 3.112551 0.0 6.0 8.0 10.00 16.0
DPD-01 689.0 7.490566 5.212656 0.0 4.0 7.0 10.00 32.0
DPD-03 689.0 2.404935 2.601996 0.0 0.0 2.0 4.00 18.0
EL-01 689.0 5.195936 4.585833 0.0 2.0 4.0 7.00 29.0
EL-03 689.0 8.341074 5.414082 0.0 5.0 8.0 11.00 38.0
EL-05 689.0 5.290276 5.603926 0.0 2.0 4.0 7.00 48.0
FH-04 689.0 5.917271 3.751411 0.0 3.0 6.0 8.00 22.0
ID-04 689.0 3.963716 3.996745 0.0 1.0 3.0 6.00 24.0
PS-04 689.0 8.195936 8.135708 0.0 4.0 7.0 11.00 142.0
PS-05 689.0 5.613933 4.158087 0.0 3.0 5.0 8.00 30.0
SLU-01 689.0 12.165457 9.145079 0.0 5.0 10.0 17.00 50.0
SLU-02 689.0 10.068215 5.630042 0.0 6.0 10.0 14.00 32.0
SLU-04 689.0 8.233672 6.192806 0.0 4.0 7.0 12.00 48.0
SLU-07 689.0 10.458636 7.894570 0.0 4.0 8.0 15.00 42.0
SLU-15 689.0 14.505080 8.536365 0.0 8.0 14.0 20.00 48.0
SLU-16 689.0 8.795356 8.992095 0.0 3.0 6.0 12.00 51.0
SLU-17 689.0 8.238026 7.888482 0.0 2.0 6.0 12.00 68.0
SLU-19 689.0 10.865022 6.683941 0.0 6.0 10.0 15.00 36.0
SLU-20 447.0 4.774049 3.943030 0.0 2.0 4.0 7.00 22.0
SLU-21 352.0 2.451705 2.590350 0.0 1.0 2.0 4.00 20.0
SLU-22 60.0 12.683333 6.215049 4.0 9.0 11.0 16.25 28.0
UD-01 689.0 5.492017 4.780381 0.0 2.0 4.0 8.00 30.0
UD-02 689.0 2.081277 2.709117 0.0 0.0 1.0 3.00 16.0
UD-04 689.0 5.268505 4.673858 0.0 2.0 4.0 8.00 30.0
UD-07 689.0 3.708273 3.380511 0.0 1.0 3.0 5.00 26.0
UW-02 689.0 3.001451 3.168476 0.0 1.0 2.0 4.00 22.0
UW-04 689.0 4.065312 3.487949 0.0 2.0 3.0 6.00 26.0
UW-06 689.0 3.902758 4.051716 0.0 1.0 3.0 6.00 26.0
UW-07 689.0 2.682148 3.015799 0.0 0.0 2.0 4.00 24.0
UW-10 689.0 2.029028 2.601827 0.0 0.0 1.0 3.00 18.0
UW-11 308.0 3.207792 3.306642 0.0 0.0 2.0 5.25 16.0
WF-01 689.0 18.946299 14.597358 0.0 7.0 16.0 28.00 110.0
WF-03 23.0 6.956522 3.831428 0.0 4.0 6.0 9.50 14.0
WF-04 689.0 9.226415 8.179196 0.0 3.0 8.0 13.00 60.0

Can even do by day of week.

trips_by_station_date_merged.groupby(['from_station_id', 'weekday'])['num_trips'].describe()
count mean std min 25% 50% 75% max
from_station_id weekday
BT-01 0 99.0 15.282828 9.574669 2.0 8.00 13.0 20.00 56.0
1 99.0 13.949495 8.374370 2.0 8.00 12.0 19.50 38.0
2 99.0 13.525253 7.661650 2.0 8.00 12.0 18.00 39.0
3 98.0 15.285714 10.632570 0.0 7.25 13.0 18.00 62.0
4 98.0 17.377551 11.544191 0.0 9.25 16.0 21.75 61.0
5 98.0 20.632653 14.058338 0.0 8.25 18.0 30.75 58.0
6 98.0 15.081633 10.674832 0.0 6.00 14.5 22.00 46.0
BT-03 0 99.0 11.676768 6.948540 1.0 7.00 11.0 16.00 38.0
1 99.0 12.151515 6.153419 2.0 8.00 11.0 16.00 30.0
2 99.0 12.676768 7.296670 1.0 7.00 12.0 17.50 32.0
3 98.0 12.336735 7.303359 0.0 6.25 11.5 16.00 40.0
4 98.0 12.755102 7.065323 0.0 7.00 12.0 18.00 34.0
5 98.0 10.877551 8.820390 0.0 4.00 8.5 15.00 42.0
6 98.0 10.500000 8.474644 0.0 5.00 9.0 14.00 52.0
BT-04 0 99.0 7.858586 5.076979 0.0 4.00 7.0 10.00 26.0
1 99.0 7.878788 4.291122 1.0 5.00 7.0 10.00 24.0
2 99.0 7.919192 4.006823 0.0 5.00 8.0 10.00 20.0
3 98.0 7.816327 4.782586 0.0 4.00 7.0 11.00 24.0
4 98.0 7.755102 4.240359 0.0 5.00 7.0 10.00 20.0
5 98.0 4.346939 3.964019 0.0 2.00 4.0 6.00 23.0
6 98.0 3.979592 3.508398 0.0 1.00 3.0 6.00 16.0
BT-05 0 99.0 9.454545 6.429004 1.0 4.50 8.0 13.50 32.0
1 99.0 9.030303 6.333350 1.0 4.00 8.0 13.00 28.0
2 99.0 9.040404 6.626969 0.0 4.00 7.0 12.50 34.0
3 98.0 9.540816 7.211343 0.0 4.00 8.0 14.00 30.0
4 98.0 10.255102 6.709451 0.0 5.00 9.0 14.00 42.0
5 98.0 9.000000 7.550517 0.0 3.00 8.0 13.00 42.0
6 98.0 7.377551 5.999167 0.0 3.00 7.0 10.00 28.0
CBD-03 0 99.0 8.585859 8.279567 0.0 4.00 6.0 10.00 57.0
1 99.0 8.393939 6.589754 0.0 4.00 6.0 12.00 34.0
... ... ... ... ... ... ... ... ... ...
UW-10 5 98.0 2.377551 3.131204 0.0 0.00 1.5 3.75 18.0
6 98.0 1.775510 3.000105 0.0 0.00 0.0 2.00 16.0
UW-11 0 44.0 3.272727 3.142830 0.0 1.00 3.0 5.00 11.0
1 44.0 3.181818 3.074824 0.0 0.00 2.0 6.00 10.0
2 44.0 3.431818 3.030064 0.0 0.00 3.0 6.00 11.0
3 44.0 3.772727 3.469895 0.0 1.00 2.5 6.00 16.0
4 44.0 3.681818 3.489337 0.0 0.75 3.0 6.00 13.0
5 44.0 2.636364 3.424202 0.0 0.00 1.0 3.25 12.0
6 44.0 2.477273 3.480769 0.0 0.00 1.0 3.00 15.0
WF-01 0 99.0 17.212121 11.908900 0.0 7.00 16.0 26.00 56.0
1 99.0 16.222222 10.312025 2.0 8.00 14.0 21.50 56.0
2 99.0 15.616162 10.238689 2.0 7.00 13.0 22.00 44.0
3 98.0 18.438776 11.849630 2.0 9.00 17.5 26.00 66.0
4 98.0 17.979592 13.641945 0.0 6.00 14.0 28.00 53.0
5 98.0 24.530612 19.283669 0.0 8.00 22.0 37.00 74.0
6 98.0 22.704082 19.682641 0.0 5.25 20.0 34.75 110.0
WF-03 0 3.0 4.000000 1.000000 3.0 3.50 4.0 4.50 5.0
1 4.0 5.000000 4.690416 0.0 2.25 4.5 7.25 11.0
2 4.0 6.250000 2.872281 4.0 4.00 5.5 7.75 10.0
3 3.0 8.333333 4.041452 4.0 6.50 9.0 10.50 12.0
4 3.0 10.333333 3.214550 8.0 8.50 9.0 11.50 14.0
5 3.0 7.333333 5.859465 3.0 4.00 5.0 9.50 14.0
6 3.0 8.333333 3.055050 5.0 7.00 9.0 10.00 11.0
WF-04 0 99.0 8.383838 6.974746 0.0 3.00 6.0 12.00 32.0
1 99.0 7.525253 6.221004 0.0 3.00 6.0 11.00 32.0
2 99.0 6.888889 6.057360 0.0 2.00 6.0 10.00 29.0
3 98.0 8.163265 6.818250 0.0 4.00 6.5 12.00 36.0
4 98.0 8.765306 6.601782 0.0 4.00 7.5 11.00 29.0
5 98.0 13.326531 10.597389 0.0 4.00 12.0 19.00 48.0
6 98.0 11.581633 10.658643 0.0 2.25 9.5 18.00 60.0

378 rows × 8 columns

A similar example using R - airline flights

Here’s how you can do the same thing using R. I’ll use the nycflights13 package and we’ll compute the average number of flights by day of week by carrier.

In order to use R magic within Jupyter notebook cells, you need to:

pip install rpy2

The latest version (2.9.4 as of this post) requires the tzlocal module as well.

pip install tzlocal

Now you can load the rpy2.ipython module via a magic command.

%load_ext rpy2.ipython

Now just put %%R at top of any cell you want to treat as an R code chunk.

%%R
library(dplyr, quietly = TRUE, warn.conflicts = FALSE, verbose = FALSE)
library(lubridate, quietly = TRUE, warn.conflicts = FALSE, verbose = FALSE)
library(ggplot2, quietly = TRUE, warn.conflicts = FALSE, verbose = FALSE)
library(nycflights13, quietly = TRUE, warn.conflicts = FALSE, verbose = FALSE)

Load the NYC Flights data package and add a flight date field.

%%R
flights <- nycflights13::flights
flights$fl_date <- with(flights, ISOdate(year, month, day))

Count the number of flights by date by carrier.

%%R
flights_by_date_carrier_1 <- flights %>%
  group_by(carrier, fl_date) %>%
  summarize(numflights = n())

We need to make sure that we have a data frame containing every combination of fl_date and carrier so that the days with zero flights are properly accounted for when computing the average number of flights by day of week. R has a function called expand.grid which is perfect for this. I’ll create two vectors - one containing all the dates and one containing all the carriers. Then expand.grid will create a dataframe with all the combinations of the elements of two vectors.

%%R
dates <- seq(ymd('2013-01-01'),ymd('2013-12-31'),by='days')
carriers <- unique(flights$carrier)
date_carrier <- expand.grid(carriers, dates)
names(date_carrier) <- c("carrier", "fl_date")

Now we can merge the two dataframes using a left join.

%%R
flights_by_date_carrier <- merge(x = date_carrier, y = flights_by_date_carrier_1,
                      by.x = c("carrier", "fl_date"),
                      by.y = c("carrier", "fl_date"),
                      all.x = TRUE)

Add a day of week column for convenience.

%%R
flights_by_date_carrier$weekday <- wday(flights_by_date_carrier$fl_date)

Finally, compute the average number of flights.

%%R
flights_carrier_dow <- flights_by_date_carrier %>%
  group_by(carrier, weekday) %>%
  summarize(
    avg_numflights = mean(numflights)
  )

Let’s plot it.

%%R
ggplot(data=flights_carrier_dow) + geom_bar(aes(x=factor(weekday, labels=c("Su","Mo","Tu","We","Th","Fr","Sa")), y=avg_numflights), stat = "identity") + facet_wrap(~carrier) + 
  xlab("Day of Week") +
  ylab("Avg # of Flights") +
  ggtitle("Avg # Flights by Day of Week and Carrier")

Closing remarks - check out hillmaker

In addition to simply counting number of bike rentals (or flights), we might be interested in related statistics such as the average and percentiles of the number of bikes in use by time of day and day of week. My hillmaker package is designed for just such analysis. You can find an example here of it being used to analyze the Pronto Cycle Share data.

Reuse

Citation

BibTeX citation:
@online{isken2018,
  author = {Mark Isken},
  title = {Computing Daily Averages from Transaction Data Using Pandas
    Can Be Tricky - {Part} 2},
  date = {2018-07-18},
  langid = {en}
}
For attribution, please cite this work as:
Mark Isken. 2018. “Computing Daily Averages from Transaction Data Using Pandas Can Be Tricky - Part 2.” July 18, 2018.