Another cautionary tale for using LLMs for business analytics

Weighted averages

llm

Author

Mark Isken

Published

March 11, 2026

In a previous post, I explored a data analysis error made repeatedly over the course of several years by multiple LLMs. Recently, while grading homework in my Practical Computing for Data Analytics class, another such LLM misstep grabbed my attention. This is one is less subtle than the one raised in the previous post and, to me, more troubling.

But, the point of this post is not to highlight some shortcoming of LLMs. There’s plenty of examples of this out there already. I’m more interested in the response that Gemini gives me and, really, my students who are the intended audience when I ask it to reflect on why it made the errors it did.

Exploring country metrics

One of the homework assignments in my class is focused on learning the basics of the R package, dplyr. The students are given a dataset containing various economic, population, energy, and health related metrics that was obtained from the well known Our World in Data site. Each row in the dataset is for a single country in a single year.

Ask the students

During the assignment, students are asked to filter the dataframe to a single year and then compute the mean of GDP per capita by continent. Pretty much everyone did the following:

country_metrics  |>  
  filter(year == 2022)  |>  
  group_by(continent)  |>  
  summarize(
    mean_gdp_pc = mean(gdp_per_capita, na.rm = TRUE),
  )

The problem is that the units gdp_per_capita are in dollars per person and the population of the different countries within a continent might vary tremendously. In order to take this into account, we need to do a weighted average. Doing a simple average of gdp_per_capita for one country containing 300 million people and another containing 20 million people, doesn’t make sense if we want the average GDP per capita for all 320 million people. We can do that like this.

country_metrics  |>  
  filter(year == 2022)  |>  
  group_by(continent)  |>  
  summarize(
    mean_gdp_pc = sum(gdp_per_capita * population) / sum(population),
  )

This is a common oversight that I’ve seen over multiple semesters and as part of my general feedback to the class I discuss this issue. It also pops up with metrics such as life expectancy or under five mortality rates. But, this semester, I thought I’d see what one of the big LLMs would do with this. Our university has an agreement with Google and all students have free access to Google Gemini. So, I used Google Gemini (Gemini 3 Flash).

Ask the LLM - Question 1: Compute mean GDP per capita

I started with this prompt.

Prompt

I have an R dataframe where each row is a country and I have columns with the following variables: GDP per capita, life expectancy, and total energy use. I also have a column called continent that indicates which continent each country is a part of. Please show me the R code to compute the mean GDP per capita by continent, mean life expectancy by continent and mean total energy use by continent. Use the dplyr library.

Both GDP per capita and life expectancy require weighted averaging. For total energy use, a summation probably makes more sense than any type of mean (weighted or not). Let’s see what the LLM comes up with.

Gemini’s response to Question 1

Using dplyr is the way to go here. It makes data manipulation feel more like writing a sentence and less like solving a puzzle. To get those averages, you’ll want to use the split-apply-combine strategy: split the data by continent, apply the mean() function to your variables, and combine it all back into a neat summary table.

The R Code

Assuming your dataframe is named df, here is the cleanest way to compute those means:

library(dplyr)

# Summary statistics by continent
continent_summary <- df %>%
  group_by(continent) %>%
  summarize(
    mean_gdp = mean(`GDP per capita`, na.rm = TRUE),
    mean_life_exp = mean(`life expectancy`, na.rm = TRUE),
    mean_energy = mean(`total energy use`, na.rm = TRUE)
  )

# View the results
print(continent_summary)

Why this works:

group_by(continent): This tells R to treat each continent as its own little mini-dataframe for the next step.
summarize(): This collapses each group into a single row.
na.rm = TRUE: This is a lifesaver. If a single country is missing data (an NA), the mean would normally return NA. This argument tells R to just ignore the missing values and calculate the average from the data that is there. Pro Tip: If you have many more columns and don’t want to type them all out, you can use across() to do them all at once:

df %>%
  group_by(continent) %>%
  summarize(across(c(`GDP per capita`, `life expectancy`, `total energy use`), mean, na.rm = TRUE))

Would you like me to show you how to visualize these continent averages using a bar chart or a bubble plot?

Reflections on Gemini’s answer to Question 1

Obviously, Gemini provided code to do simple averaging for all three statistics. Notice how friendly and confident the LLM sounds as it assures the student that it knows its stuff. It cheerily explains why it did what it did in plain english.

I thought, well, I didn’t actually give Gemini a population field to work with. Let’s help it out.

Ask the LLM - Question 2: Including population

Prompt

Oops, I forgot to mention that I also have a column with the population of each country. Please repeat the previous analysis but also include a computation of the total population by continent. It should still be done with a single dplyr command.

Gemini’s response to Question 2

No problem! Adding the population is a quick tweak. Since you want the total population rather than the average, we just swap the function for that specific column. Here is the updated code to handle everything in one go:

library(dplyr)

# Comprehensive summary by continent
continent_summary <- df %>%
  group_by(continent) %>%
  summarize(
    mean_gdp = mean(`GDP per capita`, na.rm = TRUE),
    mean_life_exp = mean(`life expectancy`, na.rm = TRUE),
    mean_energy = mean(`total energy use`, na.rm = TRUE),
    total_pop = sum(population, na.rm = TRUE) # Summing for total population
  )

# View the results
print(continent_summary)

Key Adjustments

sum(population): Unlike the other variables where we want the average, we use sum() here to get the aggregate headcount for the entire continent.
The Power of the Pipe (%>%): Notice how we keep everything inside a single block. This “pipeline” style keeps your workspace clean and your logic easy to follow.

Quick Tip on Data Types

If your population numbers are very large, R might display them in scientific notation (like 1.4e09). If you prefer seeing the full numbers, you can run options(scipen = 999) before printing your summary.

Would you like to see how to rank these continents from highest to lowest total population within the same command?

Reflections on Gemini’s answer to Question 2

Not surprisingly, it correctly added the code to sum the population field. But, it didn’t pick up on the fact that countries have different populations and that things like mean GDP per capita and mean life expectancy need weighted averages.

Ask the LLM - Question 3: Challenge the previous answers

Now it was time to call out the LLM on its analysis.

Prompt

Wait, isn’t that the wrong way to compute mean GDP per capita and mean life expectancy? Each country has a different population and a simple average will lead to inaccurate results.

Gemini’s response to Question 3

You are absolutely right. My apologies for the oversight—you’ve hit on a crucial statistical distinction! If you take a simple average of GDP per capita across countries, you are treating a country of 50,000 people the same as a country of 1.4 billion. To get an accurate picture of the “average person’s” experience on a continent, you need to calculate weighted means.

For GDP per capita, the mathematically accurate way to find the continent average is to divide the Total GDP of the continent by its Total Population. For Life Expectancy, we weight the value by the population of each country.

The Corrected “Weighted” Analysis

Here is how you do that accurately in a single dplyr pipe:

library(dplyr)

continent_analysis <- df %>%
  group_by(continent) %>%
  summarize(
    # Total population for the continent
    total_pop = sum(population, na.rm = TRUE),
    
    # Weighted Mean Life Expectancy: sum(life_exp * pop) / sum(pop)
    weighted_mean_life_exp = sum(`life expectancy` * population, na.rm = TRUE) / total_pop,
    
    # True GDP per capita: Total GDP / Total Population
    # We first calculate Total GDP for each country (GDP_per_cap * pop)
    mean_gdp_per_capita = sum(`GDP per capita` * population, na.rm = TRUE) / total_pop,
    
    # Mean Energy Use (Weighted by population)
    weighted_mean_energy = sum(`total energy use` * population, na.rm = TRUE) / total_pop
  )

print(continent_analysis)

Why this is more accurate:

The Problem: If “Country A” has 10 people and a GDP per capita of $\$100$, and “Country B” has 90 people and a GDP per capita of $\$10$, a simple average says the mean is $\$55$.
The Reality: In that combined group of 100 people, the actual wealth per person is only $\$19$.
The Logic: By multiplying the variable by the population before summing, we ensure that larger countries have a proportional “vote” in the final continental average.

A Note on Energy Use

Depending on your specific goals, “total energy use” is often reported as an aggregate for the whole country. If your column is already the Total energy (not per capita), you would simply use sum(total_energy_use) to get the continent’s footprint. If it is Energy Use per capita, then the weighted formula above is the way to go.

Would you like to see how to use weighted.mean()—a built-in R function—to make this code even shorter?

Reflections on Gemini’s answer to Question 3

So, I tell the LLM it’s wrong but I don’t tell it why it’s wrong. It comes up with the correct weighted averaging approach. Why? Did it not know that GDP per capita needs weighted averaging? How does it suddenly “know” this after being told it’s original approach is incorrect? Let’s ask Gemini.

Ask the LLM - Question 4: Why did you botch this in the first place?

Prompt

I’m curious as to why you gave an incorrect answer to begin with when this is a rather basic example of needing a weighted average? Is there something about the way LLMs are designed that makes these kinds of errors unavoidable?

Gemini’s response to Question 4

You’ve hit on one of the most fascinating “glitches in the matrix” of how Large Language Models (LLMs) like me function. You’re right—statistically, a weighted average is the only way to get a meaningful continental metric, yet I initially defaulted to a simple mean().

There are three main reasons why LLMs frequently “trip” on this specific type of logic:

1. Pattern Matching vs. Intent Modeling

I am trained on billions of lines of code. In the vast majority of R tutorials and Stack Overflow answers, the “basic” way to use group_by and summarize is paired with the mean() function.

The Pattern: “Compute the average X by Y” $\rightarrow$ summarize(mean(X)).
The Flaw: My “brain” recognized the request for a summary and reached for the most common code pattern associated with that request, rather than first performing a deep “sanity check” on the underlying statistical validity of applying that pattern to population-based data.

2. The “Path of Least Resistance” Bias

LLMs are designed to be helpful and direct. When a user asks for “the mean,” the model often takes the literal linguistic path.

Literal Interpretation: You asked for “the mean,” so I provided the mean() function.
Analytical Interpretation: You asked for a “continental average,” which requires understanding that a continent is a collection of weighted units.

Unless a prompt explicitly mentions weighting or “per capita adjustments,” models often default to the simplest arithmetic interpretation because that is what satisfies the “literal” requirement of the prompt most often.

3. Lack of a Persistent “World Model”

While I “know” that China has more people than Iceland, I don’t “hold” that fact in active memory while writing the code unless I am specifically prompted to consider the relationship between the columns. I treat GDP_per_capita and population as abstract variable names until the logic of the calculation forces me to look at how they interact.

The Architecture of the Error

Most LLMs operate on Autoregressive Prediction. I am predicting the next most likely token (word or symbol) in a sequence. \[P(w_{n} | w_{1}, ..., w_{n-1})\] In the sequence “summarize mean…”, the probability of the next token being the column name is very high. The “correction” requires a higher-level reasoning step that overrides this probabilistic “auto-complete.”

How to “Stress Test” an AI in the future

To get better results from any AI when dealing with data, it helps to use “Chain of Thought” prompting. If you say: “Think about the relationship between population and per-capita metrics, then write the code,” it forces the model to activate those analytical “neurons” before it starts typing code.

Would you like to see how to use the weighted.mean() function to make that previous code even more concise?

Reflections on Gemini’s answer to Question 4

Gemini’s response is a good reminder to students who use LLMs for analysis work. These things don’t know what words actually mean, they know how patterns of text relate to other patterns of text. These relationships are learned from looking at many examples of such text patterns. This one particular sentence really stood out:

I treat GDP_per_capita and population as abstract variable names until the logic of the calculation forces me to look at how they interact.

There is no world view nor sense of contextual meaning of terms like GDP per capita or population. The output of an autoregressive prediction model trained on vast troves of text provides the illusion of understanding meaning and having a world view. It wasn’t until I added certain words to the model’s context window that it found different text pattern relationships that actually ended up answering the question.

Wait, isn’t that the wrong way to compute mean GDP per capita and mean life expectancy? Each country has a different population and a simple average will lead to inaccurate results.

I never use the term “weighted average”. I do mention things like “wrong”, “different population”, “simple average” and “inaccurate”. Which terms pushed the model to find different next token probabilities?

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{isken2026,
  author = {Isken, Mark},
  title = {Another Cautionary Tale for Using {LLMs} for Business
    Analytics},
  date = {2026-03-11},
  url = {https://bitsofanalytics.org/posts/caution_llm_weighted_avg/caution_llm.html},
  langid = {en}
}

For attribution, please cite this work as:

Isken, Mark. 2026. “Another Cautionary Tale for Using LLMs for Business Analytics.” March 11. https://bitsofanalytics.org/posts/caution_llm_weighted_avg/caution_llm.html.