Trying out DataCamp AI Assistant

You better check that code…

python

teaching

Author

Mark Isken

Published

April 17, 2023

An invitation

Today I got an email from DataCamp inviting me to try out their new AI Assistant (powered by OpenAI) within their cloud based notebook environment called Workspace. I teach analytics courses in a business school. For one of those courses, students get access to DataCamp for the semester but its use in the course is entirely optional. They have some very nice tutorials for those getting started in analytics and data science and it gives students another way to start to learn to do analytics work R and Python.

The AI Assistant invite was not expected and I really hadn’t even thought about DataCamp rolling out something like this - but it makes perfect sense. So, I figured I might as well give it a try. Now, I’ve been (trying) keeping up with the LLM/GPT/AI frenzy that’s been happening since last fall, but other than a few basic interactions with ChatGPT3.5, I really haven’t used them much at all. As an educator, I have been closely following discussions in the higher ed community about the impact of these tools. I say all that to emphasize that I’m very much a GPT newb and what follows is by no means a rigorous or exhaustive look at this AI Assistant. I’m sure there are better ways to interact with this thing. This is just me messing around with some typical kinds of things I teach and use in my classes.

Basic (and not so basic) analysis of cycle share data

Within the Workspace cloud notebook environment, there are a bunch of preloaded datasets including some cycle share data that has already been summarized in terms of number rides per date and includes some weather related variables. I did some basic pandas group by queries and simple plots and the AI Assistant worked quite well. In my classes I’ve used cycle share data but we typically work with the raw trip data in which each row is a bike rental instance. I noticed that you can upload your own dataset into a Workspace. Ok, let’s see what this thing can do with a dataset that is not pre-loaded, but is very prominent out on the interwebs.

I uploaded the trip.csv datafile from the Pronto Cycleshare Dataset. It’s a few hundred thousand rows and just under 50Mb in size. No problem with the upload.

Since I knew we were going to be doing some datetime work, I helped the AI with the import by asking:

Import the trip.csv file into a pandas dataframe. The starttime and stoptime fields are datetimes.

It generated the following perfectly correct code.

import pandas as pd

df = pd.read_csv('trip.csv', parse_dates=['starttime', 'stoptime'])

Here’s what the dataframe looks like in terms of column names.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286857 entries, 0 to 286856
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   trip_id            286857 non-null  int64         
 1   starttime          286857 non-null  datetime64[ns]
 2   stoptime           286857 non-null  datetime64[ns]
 3   bikeid             286857 non-null  object        
 4   tripduration       286857 non-null  float64       
 5   from_station_name  286857 non-null  object        
 6   to_station_name    286857 non-null  object        
 7   from_station_id    286857 non-null  object        
 8   to_station_id      286857 non-null  object        
 9   usertype           286857 non-null  object        
 10  gender             181557 non-null  object        
 11  birthyear          181553 non-null  float64       
dtypes: datetime64[ns](2), float64(2), int64(1), object(7)
memory usage: 26.3+ MB

Let’s find the 8 most popular stations to rent from.

List the 8 most popular stations from which to rent a bike.

The generated code contains a column name that doesn’t exist in the dataframe. No doubt, there’s another cycle share dataset out there that does use this column name.

popular_stations = df['start station name'].value_counts().head(8)
print(popular_stations)

When I clicked Fix Error, I got:

The error is caused by the key ‘start station name’ not being present in the DataFrame ‘df’. This could be due to a misspelling or the key not existing in the DataFrame. To fix the error, we can check the column names in the DataFrame and use the correct key. Here’s the corrected code:

popular_stations = df['Start Station'].value_counts().head(8)
print(popular_stations)

Assuming the column name in the DataFrame is ‘Start Station’, this code should work without any errors.

So, it can’t go find a potentially correct field name but did help find the root cause of the error. I manually fixed the code.

popular_stations = df['from_station_id'].value_counts().head(8)
print(popular_stations)

WF-01     13054
CH-07     11392
BT-01     10934
CBD-13    10049
SLU-15     9994
CH-02      9639
CH-08      9468
SLU-01     8382
Name: from_station_id, dtype: int64

Let’s turn this into a bar chart. Can I somehow reference the previous output? Does it know about the correct field name? I tried this:

Create a bar plot from popular_stations.

Again, it used the wrong column name to (re)compute popular_stations (even though it didn’t need to recompute it). I fixed up the code manually.

import matplotlib.pyplot as plt

popular_stations = df['from_station_id'].value_counts().head(8)

plt.bar(popular_stations.index, popular_stations.values)
plt.title('Most Popular Stations to Rent From')
plt.xlabel('Station ID')
plt.ylabel('Number of Rentals')
plt.show()

In another attempt at doing the same thing, the AI Assistant used the correct column name from the previous code chunk. Strangely, it then recomputed popular_stations instead of just using it.

Now let’s make a line chart of the number of rides by date.

make a line chart of the number of rides by date.

It used a non-existent column for the start date even though we told it that starttime and stoptime were dates. Not surprisingly, the Fix Error identified the bad column name but didn’t go further than that in terms of suggesting possible columns to use (i.e. the ones whose datatype is date). So, here’s the manually fixed up code:

import matplotlib.pyplot as plt

# group the data by date and count the number of rides
rides_by_date = df.groupby(df['starttime'].dt.date).size()

# plot the line chart
plt.plot(rides_by_date.index, rides_by_date.values)
plt.title('Number of Rides by Date')
plt.xlabel('Date')
plt.ylabel('Number of Rides')
plt.show()

I asked it to make the x-axis more readable and it produced the following bit of correct code with rotated axis labels.

import matplotlib.pyplot as plt

# group the data by date and count the number of rides
rides_by_date = df.groupby(df['starttime'].dt.date).size()

# plot the line chart
plt.plot(rides_by_date.index, rides_by_date.values)
plt.title('Number of Rides by Date')
plt.xlabel('Date')
plt.ylabel('Number of Rides')

# rotate the x-axis labels for readability
plt.xticks(rotation=45)

plt.show()

Strangely, I tried to reproduce the above behavior but the generated code went back to using non-existent column names and instead of rotating the text it used DateFormatter objects. Also, the generated code would do things like trying to convert columns to datetimes even if they were already datetimes.

And now for something more difficult

Ok, let’s try something tougher like computing the average number of riders by day of week. This is not particularly easy since you have to worry about things like the specific date range used (could have a different number of, say, Mondays than the other days of the week) and the possibility of there being zero rides within a given group. Now, the zero thing is a non-issue for overall average number of rides by day of week, but, if we do something like computing the average number of rides by starting station by day of week and hour of day, there most certainly will be instances of zero rides in some of those groups. If they aren’t properly accounted for, the averages will be inflated. I did a blog post on this issue several years ago (so, there’s a chance it’s in the training data for the LLM).

Compute the average number of bike rentals by day of week.

Here’s what I got for one of the attempts:

Another time I tried the same query I got the following code. Well, actually it wanted to read some completely different csv file but I manually fixed it. Nevertheless, the computation is wrong (really wrong). If you look back at the line plot, you’ll see that there are around 400-500 rides per day.

#import pandas as pd

# load the data
#df = pd.read_csv('trip.csv')

# convert starttime column to datetime datatype
df['starttime'] = pd.to_datetime(df['starttime'])

# create a new column for day of week
df['day_of_week'] = df['starttime'].dt.day_name()

# group the data by day of week and calculate the average number of rides
avg_rides_by_day = df.groupby('day_of_week')['tripduration'].count() / df['day_of_week'].nunique()

day_of_week
Friday       6185.714286
Monday       6029.142857
Saturday     5496.000000
Sunday       4729.714286
Thursday     6278.000000
Tuesday      6126.571429
Wednesday    6134.428571
Name: tripduration, dtype: float64

Yeah, that’s not even close to being right. The AI Assistant also like to repeat import statements and reread files. This makes it difficult to create a coherent workflow as it’s easy to inadvertantly clobber previous data prep code or you have to remember to comment out unneeded file rereads.

I also tried asking for code to do average number of rides by station by day of week by hour of day. Again, got bad fields and incorrect computations. There are definitely groups with zero rides and this code isn’t taking that into account.

That’s enough for now.

Parting thoughts

I certainly wasn’t surprised that the AI Assistant failed miserably on this last task as it’s not a simple query and the correct approach is unlikely to be very prominent in the training data. I was a bit surprised how often the AI Assistant would use non-existent column names or filenames, or try to do datatype conversions on columns that were already of the desired data type. It is certainly capable of creating boilerplate code for simple things which can then be manually patched up (e.g. fixing column names). I’m sure it will improve over time via some sort of reinforcement learning or non-LLM based tweaks to prevent things like nonexistent column name use. For now, I’m sticking with StackOverflow and writing my own code.

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{isken2023,
  author = {Mark Isken},
  title = {Trying Out {DataCamp} {AI} {Assistant}},
  date = {2023-04-17},
  langid = {en}
}

For attribution, please cite this work as:

Mark Isken. 2023. “Trying Out DataCamp AI Assistant.” April 17, 2023.