A cautionary tale for using LLMs for business analytics

Revisiting the bikeshare analysis problem

python

pandas

bikeshare

llm

Author

Mark Isken

Published

August 5, 2025

By now it has become passé to show examples of large language models (LLMs) making errors. We know they make errors, often called hallucinations, and there is some evidence that recent models have actually gotten worse with respect to such behavior. Yet, there is one particular case I want to highlight here. I have been exploring this error since ChatGPT came on the scene in November of 2022 and it raises a number of issues regarding the use of LLMs for teaching, learning and doing business analytics.

The bike share analysis problem

Bike share systems are common in large cities. Bicycles are available for rental from numerous geographically dispersed stations. Riders can return their bike to any station leading to challenging and interesting bike inventory rebalancing and capacity planning problems. These systems produce a large volume of transaction data that can be analyzed to aid such decision making problems (Freund, Henderson, and Shmoys 2022; Haider et al. 2018; Schuijbroek, Hampshire, and Van Hoeve 2017; Jian et al. 2016).

For each rental, we know where the rental originated and where the bike was returned. Timestamps for both of these events make it possible to do detailed analysis of rental volume patterns. System users might be classified as either member or a short-term pass holder and analysis may reveal very different rental patterns for these different user classes.

Kaggle has several samples of bike share data including data from Seattle (https://www.kaggle.com/datasets/pronto/cycle-share-dataset) and Washington D.C. (https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset). We will be using the Seattle dataset for this example.

The three main data tables are the trip, station and weather tables. The trip table contains one row per bike rental. The field names, data types and example data are shown in Table [table:trip_dict].

Trip table dictionary
`trip_id`	integer	unique trip identifier	4242
`starttime`	string	datetime bike rented from source	“10/20/2014 17:43”
`stoptime`	string	datetime bike returned to destination	“10/20/2014 17:56”
`bike_id`	string	unique bike identifier	“SEA00255”
`trip_duration`	float	time bike rented in seconds	792.294
`from_station_name`	string	common name of source station	“3rd Ave & Broad St”
`to_station_name`	string	common name of destination station	“Dexter Ave N & Aloha St”
`from_station_id`	string	source station identifier	“BT-01”
`to_station_id`	string	destination station identifier	“SLU-02”
`usertype`	string	member or short-term	“Member”

In the Seattle version of the dataset available on Kaggle, there are 286857 records in the trip table spanning a period of 689 days ( 98 weeks). On average, there are a little over 2900 rentals per week.

I have used this dataset in my spreadsheet modeling class for an assignment on exploratory data analysis. Students are presented with a scenario in which they play the role of a consultant tasked with answering several analysis questions and creating a presentation for a new manager of the bike share system. For example:

How does the volume of rides vary over time?
Which stations are most popular to rent from? Which stations are popular destinations? Which source and destination pairs are most popular?
How does rental volume vary by time of day and day of week?
What is the distribution of ride duration look like?

Each of these questions can be further refined by seeing how the results vary for Members and Short-term Pass Holders. Using the weather data, we can see if and how different weather conditions impact rental volume and ride duration.

I have also used this dataset in my Practical Computing for Data Analytics course for modules and assignments on EDA and predictive modeling using R and Python. One of the questions I usually pose is to ask for the average number of bikes rented by day of week. Seems simple enough, but there are a few subtleties that make this a little trickier than it first appears. These subtleties become even more important for more complex variants of this question such as asking for the mean and 95th percentile of the number of bikes rented by day of week and hour of day for a specific station. We will elaborate on these issues in subsequent sections. I also use airline data such as the well known nycflights13 R package and ask similar questions with respect to the number of flights by day of week for certain airlines.

When ChatGPT was released, one of the first things I tried was to ask it to generate Python code to answer these questions involving average rentals by day of week - spoiler alert: the original results were abysmal. Over the past few years as LLMs have improved and more models have been made available, I have retried these questions on a few different LLMs and think that the results are illuminating both with respect to the advances made as well as the problems that still remain. The results are quite relevant for those of us trying to prepare the next generation of data analysts and data scientists.

The first attempt

In April of 2023, I noticed that DataCamp, a popular online learning hub for data science, had made an AI Assistant available which was powered by the LLMs created by OpenAI. I was able to upload the trip.csv file into a workspace and then ask questions directly to the AI Assistant.

Prompt

Import the trip.csv file into a pandas dataframe. The starttime and stoptime fields are datetimes.

The AI Assistant responded with the following perfectly good Python code.

Response

import pandas as pd
df = pd.read_csv('trip.csv', parse_dates=['starttime', 'stoptime'])

Not wanting to try anything too complex yet, I followed up with a question about popular rental stations.

Prompt

List the 8 most popular stations from which to rent a bike.

The generated code used a field name, start station name, that is not part of the df dataframe.

Response

popular_stations = df['start station name'].value_counts().head(8)
print(popular_stations)

Perhaps there is another cycle share dataset out there that does use this column name. Apparently, the AI Assistant’s underlying LLM could not run code and use the results in follow up questions. When I clicked the Fix Error button provided by the AI Assistant, I got:

Response

The error is caused by the key ‘start station name’ not being present in the DataFrame ‘df’. This could be due to a misspelling or the key not existing in the DataFrame. To fix the error, we can check the column names in the DataFrame and use the correct key. Here’s the corrected code:

popular_stations = df['Start Station'].value_counts().head(8)
print(popular_stations)

So, it was unable to find a potentially correct field name but did help find the root cause of the error. I manually fixed the code. From there I moved on to other relatively simple tasks such as creating a plot based on the popular_stations variable and plotting the number of rentals by date. Again, the LLM returned code that used non-existent field names and made other basic, easy to spot, errors. Even though my confidence level was extremely low, I decided to see what the AI Assistant would do with a more difficult question. For reference, the plot of number of rentals by date shows that the overall average is in the ballpark of 415 rides/day. It varies over time and has a seasonal pattern, but the overall magnitude is on the order of several hundred bike rentals per day.

Prompt

Compute the average number of bike rentals by day of week.

The AI Assistant returned the following code to do the computation:

Response

# Convert starttime to datetime object
df['starttime'] = pd.to_datetime(df['starttime'])

# Group by day of week
grouped = df.groupby(df['starttime'].dt.dayofweek)

# Calculate number of bike rentals for each day of week
rentals = grouped.size()

# Calculate average number of bike rentals by day of week
average = rentals.mean()

# Print average
print(average)
40979.57142857143

The average was not computed for each day of the week and even a weekly rental volume of near 41000 is not even close to the actual value of 2916 rentals per week.

After making it clear that I wanted the average for each of the days of the week, the AI Assistant generated the following:

Response

# create a new column for day of week
df['day_of_week'] = df['starttime'].dt.day_name()

# group the data by day of week and calculate the average number of rides
avg_rides_by_day = df.groupby('day_of_week')['tripduration'].count() / df['day_of_week'].nunique()

print(avg_rides_by_day)

day_of_week
Friday       6185.714286
Monday       6029.142857
Saturday     5496.000000
Sunday       4729.714286
Thursday     6278.000000
Tuesday      6126.571429
Wednesday    6134.428571
Name: tripduration, dtype: float64

Again, these results are off by an order of magnitude. I shared the results of my experiment with one of my classes and told them they needed to double check any code they received from an LLM and use analytical common sense to check the plausibility of the numeric results. Nevertheless, several students turned in eerily similar code that led to similarly blatantly incorrect results in response to a question I posed about the average number of airline flights by day of week for flights out of a specific airport during a given year. I wrote up this experience as a blog post which ended with these parting thoughts:

I certainly wasn’t surprised that the AI Assistant failed miserably on this last task as it’s not a simple query and the correct approach is unlikely to be very prominent in the training data. I was a bit surprised how often the AI Assistant would use non-existent column names or filenames, or try to do datatype conversions on columns that were already of the desired data type. It is certainly capable of creating boilerplate code for simple things which can then be manually patched up (e.g. fixing column names). I’m sure it will improve over time via some sort of reinforcement learning or non-LLM based tweaks to prevent things like nonexistent column name use. For now, I’m sticking with StackOverflow and writing my own code.

March 2025

Fast forward to 2025 and LLMs have made significant strides. The main proprietary general purpose models included ChatGTP from Open AI, Gemini from Google, and Claude from Anthropic. There are numerous versions of these models as the companies compete for market domination and are frequently releasing new variants of their base models. How would this new generation of LLMs fare with the bike share analysis problems?

I decided to use Anthropic’s Claude 3.5 Sonnet model.

Prompt

You have a csv file named trip.csv. Each row is a bike rental from a bike share system. The fields starttime and stoptime are datetime values specifying when the bike is rented and returned. The fields from_station_id and to_station_id are strings indicating the bike rental stations that the bike was rented from and returned to. Using Python with pandas, generate the code to compute the total number of rentals from each station.

Claude code response

import pandas as pd

# Read the CSV file
df = pd.read_csv('trip.csv')

# Count rentals from each station
station_rentals = df['from_station_id'].value_counts()

# Convert to a DataFrame for better formatting
rental_summary = pd.DataFrame({
    'station': station_rentals.index,
    'total_rentals': station_rentals.values
})

# Sort by number of rentals in descending order
rental_summary = rental_summary.sort_values('total_rentals', ascending=False)

# Reset index for cleaner display
rental_summary = rental_summary.reset_index(drop=True)

# Display the results
print("\nTotal Rentals by Station:")
print(rental_summary.to_string(index=False))

The code ran with no errors and produced expected results.

Prompt

Great. Now using Python and pandas, compute the average length of time in hours that bikes are rented.

Things get more interesting as the response goes beyond just what was asked. There was one error in that a necessary library was not imported. Here is the code along with a few minor changes.

Claude code response

import pandas as pd
## MY FIX: added necessary import
import matplotlib.pyplot as plt

# Read the CSV file
df = pd.read_csv('trip.csv')

# Convert datetime strings to datetime objects
df['starttime'] = pd.to_datetime(df['starttime'])
df['stoptime'] = pd.to_datetime(df['stoptime'])

# Calculate duration in hours for each rental
df['rental_duration'] = (df['stoptime'] - df['starttime']).dt.total_seconds() / 3600

# Calculate average duration
average_duration = df['rental_duration'].mean()

# Calculate additional statistics
duration_stats = {
    'average_hours': average_duration,
    'median_hours': df['rental_duration'].median(),
    'min_hours': df['rental_duration'].min(),
    'max_hours': df['rental_duration'].max(),
    'std_hours': df['rental_duration'].std()
}

# Print results
print("\nRental Duration Statistics (in hours):")
for stat, value in duration_stats.items():
    print(f"{stat.replace('_', ' ').title()}: {value:.2f}")

# Create a histogram of rental durations for rentals under 24 hours
short_rentals = df[df['rental_duration'] <= 24]
plt.figure(figsize=(10, 6))
plt.hist(short_rentals['rental_duration'], bins=48, edgecolor='black')
plt.title('Distribution of Rental Durations (<= 24 hours)')
plt.xlabel('Duration (hours)')
plt.ylabel('Number of Rentals')
plt.grid(True, alpha=0.3)
plt.show()

The output includes a statistical summary and a histogram of ride duration. I had only asked for the mean but Claude decided to give me more. Ok, now for the real question.

Prompt

Great. Now using Python and pandas, compute the average number of bike rentals by day of week.

The response by Claude is interesting. It does far better than the early attempts described in the previous section in its overall approach to solving the problem. Yet, it makes a non-trivial syntactical error in which it uses a method that does not exist. It also overlooks the more subtle conceptual issue that lies at the heart of this entire exercise.

Like the previous example, Claude forgets to load a necessary library and then proceeds to read the data into a pandas dataframe named df. Then it begins to build up the pieces to compute the average number of rentals by day of week. First, for the numerator, we get code to count the number of rentals that occurred on each day of the week.

Claude code response

# Convert start_rental to datetime
df['starttime'] = pd.to_datetime(df['starttime'])

# Extract day of week (0 = Monday, 6 = Sunday)
df['day_of_week'] = df['starttime'].dt.dayofweek

# Create day name column for better readability
df['day_name'] = df['starttime'].dt.day_name()
# Calculate average rentals by day of week
daily_rentals = df.groupby('day_name')['starttime'].count().reset_index()
daily_rentals.columns = ['day_of_week', 'total_rentals']

Displaying the value of daily_rentals gives the following output.

Claude code output

	day_of_week	total_rentals
0	Friday	43300
1	Monday	42204
2	Saturday	38472
3	Sunday	33108
4	Thursday	43946
5	Tuesday	42886
6	Wednesday	42941

Next comes the code to compute how many of each weekday appear in the range of dates represented in the dataframe to serve as the denominator.

Claude code response

# Calculate the number of unique dates for each day of week
unique_days = df.groupby('day_name')['starttime'].dt.date.nunique().reset_index()

Running this code leads to the following error.

Claude code output

AttributeError: 'SeriesGroupBy' object has no attribute 'dt'

While the dt accessor is usable with a Series object, it’s not usable with a SeriesGroupBy object. For a beginner Python with pandas user, this is not a particularly simple error to diagnose nor fix. Beginning programmers often do not carefully read error messages, or if they do, have trouble understanding how to use the information in the error message to diagnose the problem.

Here’s how we can do this while still using the same approach as Claude.

unique_dates_df = pd.DataFrame(df['starttime'].dt.date.unique(), columns = ['date'])
unique_dates_df['date'] = pd.to_datetime(unique_dates_df['date'])
unique_dates_df['day_of_week'] = unique_dates_df['date'].dt.dayofweek
unique_dates_df['day_name'] = unique_dates_df['date'].dt.day_name()

unique_days = unique_dates_df.groupby('day_name')['date'].size().reset_index()
unique_days.columns = ['day_of_week', 'number_of_days']

print(unique_days)

	day_of_week	number_of_days
0	Friday	98
1	Monday	99
2	Saturday	98
3	Sunday	98
4	Thursday	98
5	Tuesday	99
6	Wednesday	99

After this fix, the rest of the code works perfectly.

Claude code response

# Merge and calculate average
daily_averages = pd.merge(daily_rentals, unique_days, on='day_of_week')
daily_averages['average_rentals'] = daily_averages['total_rentals'] / daily_averages['number_of_days']

# Sort by day of week (Monday first)
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
daily_averages['day_of_week'] = pd.Categorical(daily_averages['day_of_week'], categories=day_order, ordered=True)
daily_averages = daily_averages.sort_values('day_of_week')

# Round averages to 1 decimal place
daily_averages['average_rentals'] = daily_averages['average_rentals'].round(1)

# Display results
print("\nAverage Daily Rentals by Day of Week:")
print(daily_averages[['day_of_week', 'average_rentals']].to_string(index=False))

# Create a bar plot
plt.figure(figsize=(10, 6))
plt.bar(daily_averages['day_of_week'], daily_averages['average_rentals'])
plt.title('Average Number of Bike Rentals by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Average Number of Rentals')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

This approach works fine for computing the mean number of rentals under the assumption that we have at least one rental on every date in the range spanned by the dataset. But, if there is a date having zero rides, that date will not get included in the denominator counts and all of the statistics will be inflated. For overall mean rental volume by day of week, it is extremely unlikely to have zero rental demand for an entire day. However, if we modify the question to ask about mean rental volume by day of week for a particular station, there might very well be days with zero rentals.

Prompt

Great. Now using Python and pandas, compute the average number of bike rentals by day of week and hour of day.

Claude proceeds to generate code that suffers from similar shortcomings as in the previous example. It makes the same error in using a library method that does not exist. It also fails to correctly compute the denominator, the number of day of week and hour combinations over the analysis date range. It ends with this description of what its code does.

Claude code output

This code creates a comprehensive analysis of rental patterns by day and hour:

Creates a heatmap showing:
- Days of week on the y-axis
- Hours of day on the x-axis
- Color intensity indicating average number of rentals
- Actual values annotated in each cell
Provides summary statistics including:
- Top 5 peak usage times (day and hour combinations)
- Busiest hours averaged across all days
Uses proper averaging by:
- Counting total rentals for each day-hour combination
- Dividing by the number of each unique day in the dataset
- Handling day sorting correctly (Monday through Sunday)

The heatmap visualization makes it easy to:

Identify rush hour patterns
Compare weekday vs weekend patterns
Spot peak usage times
See the daily cycle of rentals

Would you like me to modify the analysis? I could:

Add seasonal patterns to see how these patterns change throughout the year
Break down patterns by station
Add weather data correlation if available
Create separate visualizations for weekdays vs weekends
Calculate additional statistics about peak vs off-peak usage

The summary and suggestion for further analysis make sense. It all sounds very certain and correct.

After fixing the syntax error involving the dt accessor, the code runs and produces a beautiful heatmap with plausible looking numbers in each colored cell. As long as there was at least one rental on each date, these results are correct. Next, we will ask to include an upper percentile along with the mean number of rentals by day of week. The approach that Claude has used so far will not work as we need the number of rentals by date as an intermediate calculation as opposed to just a total count of rentals by day of week.

Prompt

Using Python and pandas, can you compute the mean and 95th percentile of the number of bikes rented by day of week?

After reading in the data and making sure that the starttime column was a datetime, Clause proceeded with the following code to correctly compute the number of rentals by date.

Claude code response

# Extract date and day of week
df['date'] = df['starttime'].dt.date
df['day_of_week'] = df['starttime'].dt.day_name()

# Get daily counts
daily_counts = df.groupby(['date', 'day_of_week']).size().reset_index(name='count')

Again, as long as there was at least one rental for each date in the analysis date range, the daily_counts dataframe can be used to compute the mean and percentiles of interest of the number of rentals by day of week. Claude provided that code and produced a very nice summary table. With the preceding as a warm-up, now it was time to put Claude to the test. What would it do when the possibility of zero volume days was almost certain?

Prompt

Using Python and pandas, can you compute the mean and 95th percentile of the number of bikes rented by day of week for station CH-06?

Like the example above, Claude first generated a dataframe of the number of rentals by date originating from the station of interest. In Figure Figure 2, we can see that there are 653 rows in the dataframe of rentals by date. Recall that our raw data spans 689 days. There were 36 dates in which there were no rentals from this station. We need those “zero dates” in order to correctly compute the statistics of interest using the approach that Claude used in the previous query.

Indeed, Claude generated the same code logic as in the previous example and produced the summary statistics shown in the left panel of Figure Figure 3. They look plausible. They are wrong. Not by a lot, but they are wrong - see the right panel for the correct values.

While the differences in the statistics are not too different for station CH-06, this is not the case for other stations. For example, Figure Figure 4 shows the statistical summaries for station UD-02. The impact of zero volume days is quite large.

This raises a related question. Are the zero volume days scattered throughout the range of dates in the dataset or was the station offline for some period of time during this date range? If it was offline for a period, we might want to adjust the date range upon which to base our analysis. You can see the distribution of zero days in Figure Figure 5 for station CH-06 and in Figure Figure 6 for station UW-11.

A look at the same plot for station UW-11 reveals a different story.

Care must be taken to specify exactly how zero volume days should be treated when computing volume statistics. So, in the case of using tools like Claude, is there a prompting problem? In the absence of telling it to do any different, Claude defaulted to using the entire dataset. Just as the onus would be on the analyst to take zero volume days into account when doing the analysis themselves, they are equally responsible for making sure that tools like Claude do the analysis properly. While the act of writing code and working through the analysis forces the analyst to think about all of the steps required, using an LLM might lead to cognitive offloading and subtle data analysis issues overlooked in subsequent prompting and analysis. It is argued in Oakley et al. (2025) that such cognitive offloading “undermines reasoning, impedes learning, and diminishes productivity”.

Does the act of iteratively building up analysis code lead to more thinking and better analysis? If one can bypass this thinking and offload to an LLM, are we getting less nuanced analysis? Anyone who analyzes data for a living knows that data is messy and there are dragons hidden in them. Data analysts are often the best people to find these dragons and either slay them or alert others who can. LLMs may circumvent this important exploratory phase of the data analysis process.

How did I know to ask about zero volume days? I had graphed volume over time for the entire system and individual stations. I knew Python and pandas and SQL very well and knew that an aggregate query counting rentals by date was not going to have rows for those dates with zero volume. And, I realized that this would lead to statistic inflation because I had done so many similar analyses over the years and it is just common analytical sense that if the denominator is artificially small, the ratio is artificially large. Unfortunately, I didn’t have enough common sense to tell Claude about the missing zeros to see what it would do. I would rectify this oversight in July of 2025.

July 2025

After Claude 4 was released, I wanted to see if it was any better in dealing with this zero days problem. This time I structured the prompt a little differently to make it easy to test the results. I also decided to submit the same prompt to Gemini and ChatGPT.

Prompt

You have a pandas dataframe named trips. Each row is a bike rental from a bike share system. The fields starttime and stoptime are datetime values specifying when the bike is rented and returned. The fields from_station_id and to_station_id are strings indicating the bike rental stations that the bike was rented from and returned to. Using Python, write a function that will compute the mean and 95th percentile of the number of bike rentals by day of week for a specific station that is specified by the user. The user will pass in to the function the dataframe containing the rental data as well as a string indicating which station to compute the statistics for. The function should be called bike_summary_dow_station. The function should return a dataframe with three columns. The first column should be a string representing the day of the week. The second column should be called mean and should contain the mean number of bikes rented for that day of the week. The third column should be named p95 and contain the 95th percentile of the number of bikes rented by day of week.

Claude 4 responded with

Claude response

This function does the following:

Filters the data to only include trips that start from the specified station using the from_station\id field
Extracts date and day of week from the starttime column
Groups by date and day of week to count the number of rentals per day
Calculates statistics by grouping by day of week and computing the mean and 95th percentile of daily rental counts
Orders the results so days of the week appear in the standard Monday-Sunday order
Returns a DataFrame with the requested three columns: day_of_week, mean, and p95

The function handles the key aspects of your requirements: - Uses the starttime field to determine when bikes were rented - Filters by the from_station_id to focus on rentals from the specified station - Computes both mean and 95th percentile statistics - Returns results organized by day of week in a clean format

To use this function, you would call it like:

result = bike_summary_dow_station(trips, "station_123")

Claude code response

import pandas as pd
import numpy as np

def bike_summary_dow_station(trips, station_id):
    """
    Compute mean and 95th percentile of bike rentals by day of week for a specific station.
    
    Parameters:
    trips (pd.DataFrame): DataFrame with bike rental data containing columns:
                         - starttime: datetime of bike rental
                         - from_station_id: string ID of rental station
    station_id (str): Station ID to compute statistics for
    
    Returns:
    pd.DataFrame: DataFrame with columns:
                 - day_of_week: string representing day of week
                 - mean: mean number of bikes rented for that day
                 - p95: 95th percentile of bikes rented for that day
    """
    # Filter trips for the specified station
    station_trips = trips[trips['from_station_id'] == station_id].copy()
    
    # Extract date and day of week from starttime
    station_trips['date'] = station_trips['starttime'].dt.date
    station_trips['day_of_week'] = station_trips['starttime'].dt.day_name()
    
    # Count rentals per day
    daily_counts = station_trips.groupby(['date', 'day_of_week']).size().reset_index(name='rental_count')
    
    # Calculate mean and 95th percentile by day of week
    summary = daily_counts.groupby('day_of_week')['rental_count'].agg([
        ('mean', 'mean'),
        ('p95', lambda x: np.percentile(x, 95))
    ]).reset_index()
    
    # Ensure proper day of week ordering (Monday first)
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    summary['day_of_week'] = pd.Categorical(summary['day_of_week'], categories=day_order, ordered=True)
    summary = summary.sort_values('day_of_week').reset_index(drop=True)
    
    return summary

Again, the zero days were ignored and this code will lead to inflated statistics whenever there are zero days for the specified station. This time I followed up with a question.

Prompt

Thank you. I think you forgot to consider the possibility of there being zero rentals on some days from the specified station. Please modify the code to account for this possibility.

Claude responded:

Claude response

You’re absolutely right! If there are days with zero rentals for a station, those days won’t appear in the grouped data, which would skew the statistics. Let me modify the function to account for this. Now the function properly handles days with zero rentals by:

Creating a complete date range from the minimum to maximum date in the entire dataset
Generating all date/day-of-week combinations so we have every single day represented
Merging the actual rental counts with this complete date framework using a left join
Filling missing rental counts with 0 for days when the station had no rentals

This ensures that days with zero rentals are properly included in the mean and percentile calculations, giving you more accurate statistics that reflect the true distribution of daily rental activity for each station.

Claude included this code:

Claude code response

import pandas as pd
import numpy as np

def bike_summary_dow_station(trips, station_id):
    """
    Compute mean and 95th percentile of bike rentals by day of week for a specific station.
    
    Parameters:
    trips (pd.DataFrame): DataFrame with bike rental data containing columns:
                         - starttime: datetime of bike rental
                         - from_station_id: string ID of rental station
    station_id (str): Station ID to compute statistics for
    
    Returns:
    pd.DataFrame: DataFrame with columns:
                 - day_of_week: string representing day of week
                 - mean: mean number of bikes rented for that day
                 - p95: 95th percentile of bikes rented for that day
    """
    # Filter trips for the specified station
    station_trips = trips[trips['from_station_id'] == station_id].copy()
    
    # Extract date and day of week from starttime
    station_trips['date'] = station_trips['starttime'].dt.date
    station_trips['day_of_week'] = station_trips['starttime'].dt.day_name()
    
    # Get the date range from the entire dataset
    all_dates = pd.date_range(start=trips['starttime'].min().date(), 
                             end=trips['starttime'].max().date(), 
                             freq='D')
    
    # Create a complete date/day_of_week combination
    complete_dates = pd.DataFrame({
        'date': all_dates.date,
        'day_of_week': all_dates.day_name()
    })
    
    # Count rentals per day for the station
    daily_counts = station_trips.groupby(['date', 'day_of_week']).size().reset_index(name='rental_count')
    
    # Merge with complete dates to include days with zero rentals
    complete_daily_counts = complete_dates.merge(daily_counts, on=['date', 'day_of_week'], how='left')
    complete_daily_counts['rental_count'] = complete_daily_counts['rental_count'].fillna(0)
    
    # Calculate mean and 95th percentile by day of week
    summary = complete_daily_counts.groupby('day_of_week')['rental_count'].agg([
        ('mean', 'mean'),
        ('p95', lambda x: np.percentile(x, 95))
    ]).reset_index()
    
    # Ensure proper day of week ordering (Monday first)
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    summary['day_of_week'] = pd.Categorical(summary['day_of_week'], categories=day_order, ordered=True)
    summary = summary.sort_values('day_of_week').reset_index(drop=True)
    
    return summary

I followed up with:

Prompt

Why do you think you overlooked the possibility of zero rental days in your first response?

Claude said:

Claude response

Looking back, I think I overlooked it because I was focused on the mechanics of the aggregation and grouping operations without fully thinking through the data characteristics. When working with real-world data like bike rentals, it’s quite common to have days with zero activity - especially for individual stations that might be in less busy areas, during bad weather, or on certain days of the week.

My initial approach was more of a “happy path” solution where I assumed the grouped data would naturally represent all the relevant days. I was thinking about the pandas operations (groupby, agg) rather than considering the business logic of what the data actually represents.

It’s a good reminder that when working with count-based or frequency data, you often need to explicitly account for the “absence” of events (zero counts) since they don’t naturally appear in grouped datasets. This is particularly important for time-series analysis where missing periods can significantly skew statistical measures.

Thanks for catching that - it’s exactly the kind of edge case that can lead to misleading results in real analytics work.

From Claude’s response, it definitely “understands” the issue of zero rental days. What I find striking is how its response is structured as if a human analyst had written it. Claude is an LLM yet says things like “I was focused”, “without fully thinking through the data characteristics”, and “I assumed” and “I was thinking”. The user is nudged to think of Claude as a human-like entity using the same analytical processes a human might use. Yet, deep down, at the architecture level, lies an enormous neural network used to do very sophisticated and probabilistic pattern matching. Wouldn’t the human be better served if Claude admitted that it was the additional mention of the possibility of zero rental days in the follow up prompt that led to inputs that produced the correct outputs? Of course, Claude can’t “admit” anything. It’s not a reflective thinking being. The more that analysts remember this, the better chance they use it cautiously and reflectively themselves.

Then I submitted the same prompts to ChatGPT and Google Gemini (2.5 Flash). Both made the same error as Claude with almost identical code. Once prompted about the possibility of zero rental days, they were able to explain the implications in language very similar to Claude’s and to generate correct code. The three LLMs produced extremely similar code. Their answers to my question about why they made the error were very similar to that of Claude’s.

So, how should students use LLMs?

After one of these coding exchanges with Claude, I expressed my concern about how students are likely to use these tools and asked for Claude’s reaction and suggestions.

Prompt

Thank you. I have one final question for now. What would you tell a student who is just beginning to learn Python to do analytics work regarding the use of tools like Claude?

Claude response

That’s such an important question, and I think about this a lot given how AI tools are changing the learning landscape. Here’s what I’d tell a student just starting their Python analytics journey:

Embrace AI as a Learning Accelerator, Not a Replacement

Think of me like a really good tutor who’s available 24/7. I can help you understand concepts, debug code, and see examples - but the learning still needs to happen in your brain. The goal isn’t to have me write all your code, but to help you become a better programmer faster.

Use AI to Break Through Stuck Points

When you’re learning, getting stuck is incredibly frustrating and can kill motivation. Use AI tools to: - Understand error messages that seem cryptic - Get unstuck when you know what you want to do but can’t figure out the syntax - See multiple approaches to the same problem - Get explanations of code you’re trying to understand

This keeps you moving forward instead of giving up.

Always Try First, Then Ask

Before asking me to solve something, attempt it yourself. This might mean: - Writing pseudocode for what you think should happen - Trying some code even if you’re not sure it’s right - Looking up basic documentation first

Then when you ask me for help, you’ll understand my explanation much better because you’ve already engaged with the problem.

Ask for Explanations, Not Just Solutions

Instead of “Write code to analyze this data,” try: - “I’m trying to group this data by category and calculate averages. Here’s what I tried… can you explain where I went wrong?” - “Can you show me two different ways to solve this problem and explain the trade-offs?” - “I don’t understand why this pandas function works the way it does. Can you break it down?”

Build Your Foundation Systematically

AI tools are fantastic for specific problems, but they can’t replace systematic learning. Make sure you’re still: - Working through structured courses or books - Practicing basic concepts until they’re automatic - Understanding core data structures and control flow - Learning to read documentation

The Real Skill is Knowing What to Ask

As you progress, you’ll realize that the most valuable skill is knowing how to break down complex problems and ask the right questions. AI tools actually help you develop this because you quickly learn that vague questions get mediocre answers.

Don’t Skip the Struggle (But Don’t Suffer Needlessly)

Some struggle is essential for learning - it’s how your brain builds neural pathways. But there’s a difference between productive struggle (working through logical steps) and unproductive suffering (being stuck on syntax for hours). Use AI to minimize the latter while preserving the former.

The students I see succeeding with AI tools are those who use them to learn faster and explore more, not to avoid learning altogether. You’re entering an exciting time to learn programming - you have access to incredibly powerful learning tools that can accelerate your growth if you use them thoughtfully.

What specific aspects of learning Python analytics are you most excited or nervous about?

Prompt

I’m particularly worried about my students using Claude in suboptimal ways such as simply asking it to write the code for a homework problem. I worry that if this becomes a habit, they will struggle mightily when they are trying to solve complex proprietary problems in industry. How will they find subtle bugs? How will they explain their code to their coworkers or clients? How will they develop the skills to “ask the right questions” if they take the easy route in their university classes and just ask Claude for the answers?

Claude response

Your concerns are absolutely valid and reflect a critical challenge in education today. You’re right to worry - there’s a real risk that students who rely on AI as a “magic answer box” will hit a wall when they encounter complex, novel problems in industry.

The Real Dangers You’ve Identified

The Debugging Crisis: Students who haven’t learned to trace through code line-by-line, understand variable states, and systematically isolate problems will be helpless when Claude gives them code that’s 90% right but has a subtle logical error. Industry problems often involve edge cases and domain-specific constraints that generic AI solutions miss.

The Communication Gap: If students can’t explain their approach because they didn’t develop it themselves, they’ll struggle in code reviews, client presentations, and collaborative debugging sessions. “Claude wrote it” isn’t an acceptable explanation to a stakeholder.

The Problem Decomposition Deficit: The most valuable skill in analytics isn’t coding syntax - it’s breaking down messy, ambiguous business problems into tractable technical steps. Students who skip this mental muscle-building will struggle when faced with problems that don’t fit standard templates.

Pedagogical Strategies to Consider

Require Process Documentation: Ask students to submit not just code, but their problem-solving process. “What did you try first? What didn’t work? How did you modify your approach?” This makes copy-pasting much harder.

Focus on Code Reading and Debugging: Give students broken code to fix, or code to extend. These skills transfer directly to industry and are harder to outsource to AI.

Emphasize Data Understanding: Require students to explore and describe their datasets before any analysis. AI can’t replace domain knowledge and data intuition.

Two-Stage Assignments: First, require a written analysis plan without code. Then allow coding with AI assistance. This separates problem-solving from implementation.

The Bigger Picture Challenge

You’re grappling with a fundamental shift: in industry, the most valuable analysts aren’t necessarily the fastest coders, but those who can:

Ask the right business questions
Spot when results don’t make sense
Communicate insights to non-technical stakeholders
Design robust, maintainable analyses

Students who use AI as a shortcut miss developing these meta-skills.

A Possible Middle Path

Perhaps the goal isn’t to prevent AI use, but to structure it productively. For example: - “Use Claude to help debug, but first explain what you think the bug is” - “Get AI help with syntax, but write the analysis plan yourself” - “Use AI to generate multiple approaches, then justify which you chose and why”

Your instincts are spot-on. Students who develop deep problem-solving skills alongside AI fluency will be incredibly powerful. Those who become dependent on AI for thinking will struggle when they encounter the messy, ambiguous problems that define real analytics work.

There’s a lot of sensible advice in there. I’ll certainly share this with my students. Then it’s up to them.

Thoughts

Why does this one particular example bothers me so much and compel me to periodically revisit it? It’s a very common business analysis problem and not a very difficult one, yet after two years of continued LLM development and untold billions spent, the top LLMs still get it wrong. As I write this on August 8, I just tried it again with GPT-5. Still wrong. These LLMs are ever confident and I can easily see a novice analyst uncritically accepting their answer to this problem. But what really bothers me is the nagging suspicion that many students are going to use these tools to blast through homework assignments with little work to actually learn much of anything. Resisting the call of the homework machine will prove much too tempting for many. Use will be easy to rationalize by appealing to the argument that use of LLMs in industry will be required.

It will be extremely difficult for educators to convince students that it is worth their time to forgo LLM use until they have developed significant competency and familiarity with the tools and concepts being learned. I will definitely be sharing Claude’s advice to students from the previous section. Will many follow the advice? Who knows. In the past, I’ve made similar pleas to my students. For example:

Message to my students

Tools like ChatGPT can be extremely powerful in the hands of a skilled analyst or developer. However, used by a novice, can be a recipe for disaster. Not only will you not learn R and Python, you may submit work in a professional environment that is of low quality, uses outdated methods or libraries, is buggy, is blatantly wrong, or even leads to security concerns. And of course, as a novice, you won’t even know it or know what to fix or how to fix it. That sounds fun.

Furthermore, while you may use ChatGPT to get through your courses with reasonable grades, employers don’t care about your grades from some mid-tier university B-school. Everyone knows that grades are inflated and you’ll be competing against applicants from better schools and more technical degree programs. You will be forced to show you actually know the material through difficult technical interviews. And of course, faculty will have no interest in serving as a professional reference for a student who ChatGPT’d their way through the course.

ChatGPT has its place, but code novices using it as a substitute for learning to code isn’t a great use.

and this:

Message to my students

Unintended consequences of Gen AI in tech hiring

The Pragmatic Engineer is probably the top software engineering newsletter on Substack. It is extremely good. Recently he released a newsletter discussing the impact of GenAI on tech hiring. You can read a good chunk of it for free from this link - https://newsletter.pragmaticengineer.com/p/how-genai-changes-tech-hiring.

A big takeaway is that use of GenAI to “cheat” during job interviews is rampant. It’s causing such a problem that many companies have to totally rethink how they do tech interviewing and hiring. People being people, one of the easiest things for hiring firms to do is put way less weight on the types of coding questions that can be gamed during interviews and fall back on the old “pedigree” criteria - top university for degrees and high quality professional experience.

So, if you come to a place like Oakland U for your degree and then proceed to ChatGPT your way through your classes, not only will you learn a lot less than you could have, you’ll also have a degree from a place that people are potentially biased against - that UM or MSU grad gets the nod because of where they went to school. Awesome, eh? So, now you have to work doubly hard to convince people you know your stuff. That’s much easier if you didn’t ChatGPT your way through your courses.

On a similar note, I’d like to share a little professional advice. I’ve gotten wayyyyyyy too many emails this year starting with the dreaded “I hope this email finds you well”. That opening reached meme status quite a while ago. ChatGPT was/is notorious for starting its generated emails with that greeting. People everywhere are cringing every time they get such an email. Even if your email wasn’t written with ChatGPT, the recipient has a reason to believe it was or that you haven’t figured out that that greeting has worn out its welcome. I encourage you to find a more authentic way to begin your emails. Again, unintended consequences of GenAI in the tech world.

Brian Klaas, in his essay The Death of the Student Essay—and the Future of Cognition makes the case far better than I could ever hope to that uncritical use of LLMs will likely fundamentally affect humanity’s ability to think clearly. He pleads with his students.

Aware of my inability to detect AI use with certainty, I implored the students: please don’t use AI. It’s terrible for you. It’s terrible for me, a dystopian experience of spending weeks giving detailed constructive feedback to a machine. I strongly suspect—but can’t prove—that many didn’t listen.

In another letter to his students, the author likens using LLMs for coursework and expecting to learn to using a forklift to go into the weight room and expecting to get stronger. Research is starting to suggest that cognitive offloading to LLMs is eroding students’ critical thinking abilities (Gerlich (2025), Kosmyna et al. (2025), Oakley et al. (2025)). No surprise there. Ask young adults to navigate a city or suburb without Google Maps and prepare yourself for a wide-eyed plea to “be serious”.

More concerns

As companies begin to reduce entry level analyst hiring thinking that AI is a substitute, who exactly will be the one to catch subtle errors such as these? How will senior analysts be developed from a shrinking pipeline of entry level people? When decreased entry level hiring leads to drops in enrollment in the degree programs that fed these job pipelines, what will longer term talent prospects look like? Senior analysts and devs will move into management and eventually retire. Who will take their place?

Much like our modern society riding the cresting wave of cheap energy due to a one time exploitation of fossil fuels, companies are cutting labor costs by exploiting the energy jolt provided by exploiting the creative work of countless people that has accumulated over many years. The AI giants are already desperate for new “high quality” data to train their models. The once rich reservoirs of the internet are now being filled with streams of AI generated content of questionable quality and utility. The enshittification of the internet is real and AI slop is everywhere.

Relying on an LLM can make your business processes less resilient. What happens when the model goes down, or degrades, or goes out of business, or a new version behaves significantly differently? Yesterday, GPT-5 was released and Reddit is full of people complaining about it and especially its “router” that automatically determines which GPT submodel to use. Redditors mention carefully crafted workflows that are now broken due to the change in the underlying LLM powering parts of them. As agents or automated pipelines call LLMs and send the output to dashboards, will subtle errors such as the zero days problem sneak through? Using LLMs is suboptimizing for efficiency and comes at the expense of resilience. We need more resilience and less optimization. Optimization to the point of fragility is asking for trouble. Agentic AI will just magnify this lack of resilience.

I enjoy reading StackOverflow posts as well as RingTFM. The back and forth in a good SO post was more valuable to me than just getting “the answer” from some LLM. Sure, sometimes people were a bit snarky, but it’s a pretty information dense place full of actual humans who donate their time to help others solve problems. Learning about why one method is better than another or just seeing how other humans are struggling with the same problem is something I value. It made me feel part of something larger. Perusing software documentation was a bit like a browsing the stacks at a library, opening my eyes to related topics and and approaches that might prove useful. It helped me grasp the big picture of how a certain software tool fit into the problem solving landscape. And, if I revisited an SO post or a documentation page, it might have some minor editions or changes, but it was essentially the same information. It’s not stochastic.

Sadly, SO traffic has decreased considerably and its future is questionable. As fewer humans contribute to SO, LLMs lose one of their key training assets and we humans lose the power and potential of a huge networked knowledge community. I don’t want one opinion on how to do something and I certainly don’t want to blindly just accept an LLMs prescribed solution. There’s rarely one right way and the process of choosing among several approaches involves critical thinking that I feel is integral to becoming a better analyst and software developer.

Ultimately, LLMs profoundly change the problem solving process by obliterating large parts of it. It is as if society has decided that it’s perfectly fine to ask one so-called expert, who is known to make mistakes and will give you different answers every time you ask it a question, for the answers to problems we used to grapple with ourselves. Sure, we used search engines, computers, You Tube and talked to other people, but we did’t ask some computerized oracle for the answer. There’s no process nor learning journey when one simply gets the answer from an LLM. It’s efficient, for sure. Unless, of course, it’s wrong and fixing the mess turns out to be a whole lot of work.

I still haven’t learned to play the ukulele that was gifted to me. The simple reason is that I don’t practice enough. When I do practice and make some progress, I feel a sense of accomplishment and of hope. If I just do the work, I can be better at something I care about and I’ll be honoring the gift. If I don’t do the work, I won’t get any better. If I’m content to just listen to others play the uke, then that will be ok. But, will it, really?

References

Freund, Daniel, Shane G. Henderson, and David B. Shmoys. 2022. “Minimizing Multimodular Functions and Allocating Capacity in Bike-Sharing Systems.” Operations Research 70 (5): 2715–31. https://doi.org/10.1287/opre.2022.2320.

Gerlich, Michael. 2025. “AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.” {{SSRN Scholarly Paper}}. Rochester, NY: Social Science Research Network. https://doi.org/10.2139/ssrn.5082524.

Haider, Zulqarnain, Alexander Nikolaev, Jee Eun Kang, and Changhyun Kwon. 2018. “Inventory Rebalancing Through Pricing in Public Bike Sharing Systems.” European Journal of Operational Research 270 (1): 103–17.

Jian, Nanjing, Daniel Freund, Holly M. Wiberg, and Shane G. Henderson. 2016. “Simulation Optimization for a Large-Scale Bike-Sharing System.” In 2016 Winter Simulation Conference (WSC), 602–13. IEEE.

Kosmyna, Nataliya, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, and Pattie Maes. 2025. “Your Brain on ChatGPT: Accumulation of Cognitive Debt When Using an AI Assistant for Essay Writing Task.” arXiv. https://doi.org/10.48550/arXiv.2506.08872.

Oakley, Barbara, Michael Johnston, Kenzen Chen, Eulho Jung, and Terrence Sejnowski. 2025. “The Memory Paradox: Why Our Brains Need Knowledge in an Age of AI.” Available at SSRN 5250447.

Schuijbroek, Jasper, Robert C Hampshire, and W-J Van Hoeve. 2017. “Inventory Rebalancing and Vehicle Routing in Bike Sharing Systems.” European Journal of Operational Research 257 (3): 992–1004.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{isken2025,
  author = {Isken, Mark},
  title = {A Cautionary Tale for Using {LLMs} for Business Analytics},
  date = {2025-08-05},
  url = {https://bitsofanalytics.org//posts/caution_llm/caution_llm.html},
  langid = {en}
}

For attribution, please cite this work as:

Isken, Mark. 2025. “A Cautionary Tale for Using LLMs for Business Analytics.” August 5, 2025. https://bitsofanalytics.org//posts/caution_llm/caution_llm.html.