Analyzing NestWatch data with R

How many baby birds did we help?

R
ecology
birding
Author

Mark Isken

Published

March 1, 2025

About NestWatch

One of the programs run by the Cornell Lab of Ornithology is known as NestWatch.

NestWatch is a nationwide nest-monitoring program designed to track status and trends in the reproductive biology of birds.

Our local birding group has participated in this program for a number of years. One member has built numerous bluebird boxes and our group has installed and monitored these nest boxes.

If you’re reading this, it’s pretty likely you are familiar with the NestWatch program and I’m not going to expain it in detail here. Some key resources for learning more, include:

Downloading NestWatch data

Login to the https://nestwatch.org/ site containing your NestWatch data. For this post, I’ll pretend our username is Sialia.

At the bottom of the main page, after logging in, you’ll find three sections for downloading data:

Download data main

Download Nest Site Descriptions

This is a table containing information about each individual nest box. You should re-download this table whenever new bird boxes are added. Since I’m using R for the data analysis, I download the CSV version of this table.

The downloaded file will be named siteDescriptions-[username]-[yyyymmdd].csv. The [yyyymmdd] part is the date you download the file. For all of these examples, pretend you are downloading the data on 2024-02-21.

Example: siteDescriptions-Sialia-20240221.csv

We will take a brief look at this file now just to get a sense of what’s in there.

library(readr)
library(janitor, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(lubridate, warn.conflicts = FALSE)
library(tidyr)
library(stringr)
library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(gt)
raw_data_path <- "./data/raw"
site_stub <- "siteDescriptions-Sialia-20240221"

site_file <- file.path(raw_data_path, paste0(site_stub, ".csv"))
site_df <- read_csv(file = site_file, col_names = TRUE, 
                    comment = '#', show_col_types = FALSE)
str(site_df)
spc_tbl_ [58 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Site Name                : chr [1:58] "Kestrel Box" "BCNP07" "BCNP08" "BCNP09" ...
 $ Latitude                 : num [1:58] 42.8 42.7 42.7 42.7 42.7 ...
 $ Longitude                : num [1:58] -83.1 -83.2 -83.2 -83.2 -83.2 ...
 $ Substrate                : chr [1:58] "nest box / birdhouse" "nest box / birdhouse" "nest box / birdhouse" "nest box / birdhouse" ...
 $ Height Above Ground      : num [1:58] NA 5 5 5 5 5 5 5 12 12 ...
 $ Height Above Ground Units: logi [1:58] NA FALSE FALSE FALSE FALSE FALSE ...
 $ Entrance Diameter        : num [1:58] NA NA NA NA NA NA NA NA 3 3 ...
 $ Entrance Diameter Units  : chr [1:58] NA NA NA NA ...
 $ Entrance Orientation     : chr [1:58] NA "se" "se" "se" ...
 $ Site Elevation           : num [1:58] NA 948 948 948 948 950 950 957 908 934 ...
 $ Site Elevation Units     : logi [1:58] NA FALSE FALSE FALSE FALSE FALSE ...
 $ Habitat Info 1           : chr [1:58] NA "natural grassland and prairie" "natural grassland and prairie" "natural grassland and prairie" ...
 $ Habitat Info 2           : chr [1:58] NA "woodland/forest" "woodland/forest" "woodland/forest" ...
 $ Habitat Info 3           : logi [1:58] NA NA NA NA NA NA ...
 $ Nest Attempts            : num [1:58] NA 2 2 2 2 1 2 NA NA NA ...
 $ Comments                 : chr [1:58] "Kestrel box installed in 2021 by Tom K" NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Site Name` = col_character(),
  ..   Latitude = col_double(),
  ..   Longitude = col_double(),
  ..   Substrate = col_character(),
  ..   `Height Above Ground` = col_double(),
  ..   `Height Above Ground Units` = col_logical(),
  ..   `Entrance Diameter` = col_double(),
  ..   `Entrance Diameter Units` = col_character(),
  ..   `Entrance Orientation` = col_character(),
  ..   `Site Elevation` = col_double(),
  ..   `Site Elevation Units` = col_logical(),
  ..   `Habitat Info 1` = col_character(),
  ..   `Habitat Info 2` = col_character(),
  ..   `Habitat Info 3` = col_logical(),
  ..   `Nest Attempts` = col_double(),
  ..   Comments = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Some of the column names include spaces, but otherwise this file is easy to read in. Well, it’s easy as long as you specify the comment = '#' argument. Rows 2-4 of the CSV file are comments.

Download Breeding Data

This is where we get the detailed breeding data. However, the download process is a bit confusing. First of all, there are two types of downloads:

  • Summaries by site
  • Individual site visits

You can also choose individual years or all years. Let’s look at each combination of choices.

Summaries by site - All years

The downloaded file is named breedingSummary-Sialia-20240221.csv.

That’s unfortunate as there is no indication that this is a summary by site nor which years are included.

That’s not the worst problem. There is no field indicating the year. For records containing at least one filled out date, we can infer the year. However, there might be records without any dates - a nesting attempt that resulted in no eggs. That means we need to download each year individually.

Summaries by site - a single year

You can pick a year from the drop down menu and click the Download Now button. The downloaded file is named breedingSummary-Sialia-20240221.csv. Yep, it’s named exactly the same no matter which year we choose or if we choose all years.

After downloading, you need to rename the file to contain the year.

This table contains one row per nest box and summarizes the results for the year chosen. It is the primary source we use in our R based data prep machinery for the breeding details.

Let’s take a quick look at this file.

breeding_stub <- "breedingSummary-Sialia"
yr <- 2024

year_str <- as.character(yr)
breeding_file <- file.path(raw_data_path, paste0(breeding_stub, '-', year_str,".csv"))
breeding_summary_df <- read_csv(file = breeding_file, col_names = TRUE, 
                                comment = '#', show_col_types = FALSE)
str(breeding_summary_df)
spc_tbl_ [71 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Site Name       : chr [1:71] "BCNP12" "BCNP11" "BCNP08" "Watershed Barn" ...
 $ Species         : chr [1:71] "Tree Swallow" "Eastern Bluebird" "Tree Swallow" "American Robin" ...
 $ Outcome         : chr [1:71] "At least one host young fledged" "At least one host young fledged" "All young found dead in or nearby nest" NA ...
 $ 1st Egg Date    : Date[1:71], format: "2024-05-06" "2024-05-08" ...
 $ 1st Hatch Date  : Date[1:71], format: "2024-05-26" "2024-05-24" ...
 $ 1st Fledge Date : Date[1:71], format: "2024-06-11" "2024-06-12" ...
 $ No. of fledged  : num [1:71] 5 4 0 NA NA 4 NA NA NA NA ...
 $ Max clutch size : num [1:71] 5 4 4 NA NA 5 NA NA NA NA ...
 $ Total live young: num [1:71] 5 4 4 NA NA 4 NA NA NA NA ...
 $ Unhatched eggs  : num [1:71] 0 0 NA NA NA 1 NA NA NA NA ...
 $ Hatch Rate      : chr [1:71] "100.0%" "100.0%" "100.0%" NA ...
 $ Fledge Rate     : chr [1:71] "100.0%" "100.0%" "0.0%" NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Site Name` = col_character(),
  ..   Species = col_character(),
  ..   Outcome = col_character(),
  ..   `1st Egg Date` = col_date(format = ""),
  ..   `1st Hatch Date` = col_date(format = ""),
  ..   `1st Fledge Date` = col_date(format = ""),
  ..   `No. of fledged` = col_double(),
  ..   `Max clutch size` = col_double(),
  ..   `Total live young` = col_double(),
  ..   `Unhatched eggs` = col_double(),
  ..   `Hatch Rate` = col_character(),
  ..   `Fledge Rate` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Individual site visits - All years

Yep, you guessed it, the file is still named breedingSummary-OT_NestWatch-20240221.csv. At least each row has a date or datetime associated with it. I haven’t used this table yet for anything. This file is the only place to get dates related to nesting attempts that never resulted in any eggs. However, we aren’t going to address this issue in this post.

Download Species Summaries

Again, these can be done for an individual year or for all years. The all years version just aggregates things over all the years - it does not give you individual year summaries. So, if we want to use these, you need to download each year separately and rename the files as you go so you know what year it is.

Species summary - a single year

The filename is speciesSummary-Sialia-20240221.csv. Again, no indication of the year chosen in the filename. Download each year separately and rename the files as you go so you know what year it is.

Several of the column names have the '#' character in it. That’s a problem. I raised this issue with the NestWatch folks a few years ago:

In the species summary, some of the column names include the ‘#’ character. This character is also used to denote comment lines and this causes problems when reading the csv using R or Python libraries where one can specify a comment characters so that readr() or read_csv() can skip these lines. With readr(), the embedded ‘#’ characters in the column headings was causing problems with proper file reading. Specifically, this:

speciesSummary_OT_NestWatch_20230221 <- read_csv("data/raw/speciesSummary-OT_NestWatch-20230221.csv", comment = "#")

results in columns 2:N being collapsed into a single column named ‘Total’.

As of the time of this post, nothing has changed.

Here are the columns in this file:

Total # nesting attempts    
Earliest Egg Date   
Earliest Hatch Date 
Earliest Fledge Date    
Total # eggs    
Total # nestlings   
Total # fledglings  
Nest attempts with at least one fledgling   
Nesting success rate    
Mean clutch size    
Mean nestlings  
Mean fledglings

Data prep for analysis

We have three different file types we need to read and do some pre-processing on before we can proceed to analysis of the data. The scenario that we’ll use is that we have downloaded the site file as well as all of the breeding summary and species summary files for 2019-2024. The breeding and species summary files have been renamed as described above so that they have the year appended to the filename - e.g. breedingSummary-Sialia-2019.csv and speciesSummary-Sialia-2019.csv. Our goal is to combine the annual files into combined breeding summary and species summary files so that we can do analysis over time. Let’s look at each of the three file types and the data pre-processing done for each. Some of these things are going to be specific to our interests and our way of doing things in our NestWatch project.

In practice, I have a single R script that does all of this but for purposes of this post, I’ll break it up into pieces.

Data prep for the site descriptions file

The first step after reading the file is to clean up the column names.

site_file <- file.path(raw_data_path, paste0(site_stub, ".csv"))
site_df <- read_csv(file = site_file, col_names = TRUE, 
                    comment = '#', show_col_types = FALSE)
site_df <- site_df %>% janitor::clean_names()

What happens in the clean_names() function? We rely on the janitor package and its clean_names() function. Some of the key things it does is:

  • Parses letter cases and separators to a consistent format (e.g. snake case is default)
  • Handles special characters and spaces, including transliterating characters like œ to oe.
  • Appends numbers to duplicated names
  • Converts “%” to “percent” and “#” to “number” to retain meaning
  • Spacing (or lack thereof) around numbers is preserved

Let’s redo the reading of the file and see the changes made.

site_df <- read_csv(file = site_file, col_names = TRUE, 
                    comment = '#', show_col_types = FALSE)
names(site_df)
 [1] "Site Name"                 "Latitude"                 
 [3] "Longitude"                 "Substrate"                
 [5] "Height Above Ground"       "Height Above Ground Units"
 [7] "Entrance Diameter"         "Entrance Diameter Units"  
 [9] "Entrance Orientation"      "Site Elevation"           
[11] "Site Elevation Units"      "Habitat Info 1"           
[13] "Habitat Info 2"            "Habitat Info 3"           
[15] "Nest Attempts"             "Comments"                 

Now, let’s clean them.

site_df <- site_df %>% janitor::clean_names()
names(site_df)
 [1] "site_name"                 "latitude"                 
 [3] "longitude"                 "substrate"                
 [5] "height_above_ground"       "height_above_ground_units"
 [7] "entrance_diameter"         "entrance_diameter_units"  
 [9] "entrance_orientation"      "site_elevation"           
[11] "site_elevation_units"      "habitat_info_1"           
[13] "habitat_info_2"            "habitat_info_3"           
[15] "nest_attempts"             "comments"                 

Now we are going to add a box_type field which takes on values of either ‘Traditional’, ‘Peterson’, or ‘Kestrel’. For more about the Peterson box, see this post about the creator of it. We use the comments field in our site description table to store information about the box type. Using R’s grep() function, we can extract the box type and populate the box_type field.

# Initialize all as Traditional
site_df$box_type <- 'Traditional'
# Find and set the Peterson boxes
site_df$box_type[grep('Peterson', site_df$comments, ignore.case = TRUE)] <- 'Peterson'

site_df$site_name_type <- str_c(site_df$site_name, '-', str_sub(site_df$box_type, 1, 1))

# Find and set the Kestrel boxes
site_df$box_type[grep('Kestrel', site_df$comments, ignore.case = TRUE)] <- 'Kestrel'
site_df$box_type <- as.factor(site_df$box_type)

How many do we have of each?

site_df |> 
  group_by(box_type) |> 
  count()
# A tibble: 3 × 2
# Groups:   box_type [3]
  box_type        n
  <fct>       <int>
1 Kestrel         2
2 Peterson        9
3 Traditional    47

We use a box naming convention that starts with an abbreviation for the park and then a box number.

unique(site_df$site_name)[c(2:5, 11:14)]
[1] "BCNP07" "BCNP08" "BCNP09" "BCNP10" "PCT01"  "PCT02"  "PCT04"  "PCT03" 

We’ll use the park abbreviation for a new field called area.

# Add area
site_df$area <- str_extract(site_df$site_name, '^[A-Z]+')

site_df |> 
  group_by(area) |> 
  count()
# A tibble: 7 × 2
# Groups:   area [7]
  area      n
  <chr> <int>
1 BCNP     12
2 CIP      24
3 DTLP      9
4 G         1
5 K         1
6 P         1
7 PCT      10

Finally, we convert site_df to a simple features object using the sf package. While we have longitude and latitude fields in our dataframe, site_df is not really a spatially aware object. In order to make it easy to do spatial related queries, computations and visualizations, we need to fortify it with some spatial powers. To do this we’ll use the sf package.

A package that provides simple features access for R.

With sf we can represent simple planar features such as points, lines and polygons (and multi versions of these things) and manipulate them in dataframes or tibbles. In the Python world we could use the GeoPandas package (which relies on the shapely package). Simple features is a general concept supported in many geocomputational tools including things like QGIS and PostGIS.. See my earlier post on creating maps of the nest box locations with R, for more details on geocomputation in R.

library(sf)

# Convert site_df to an simple features object
site_sf <- st_as_sf(x = site_df, 
                    coords = c("longitude", "latitude"),
                    crs = "EPSG:4326")

We will need this spatial awareness later as part of the analysis.

Data prep for the species summary file

This file is a little trickier due to the use of the '#' character within several column names. We need to fix these first before attempting to use read_csv() to ingest the file. I created a function that pulls out the first line from the species summary file (the header line) and uses janitor::make_clean_names() to return a cleaned up version. Then we can use the cleaned up names when reading in the species summary file. We saw that the clean_names() function takes a dataframe as an input and modifies the column names. The make_clean_names() function takes a character vector of names as an input and returns a cleaned up character vector of names. As you might guess, clean_names() uses make_clean_names() under the hood.

fix_species_colnames <- function(src_path){
  
  # Convert cleaned line to character vector of new column names
  original_col_names <- stringr::str_split_1(readLines(src_path, 1), ",")
  
  # Now clean names (janitor handles the '#' replacement with '_number_')
  clean_col_names <- janitor::make_clean_names(original_col_names)
  
}

The main part of code loops over the years that are to be read in. At each iteration, the species summary CSV file is read in and stored in a list. This list of dataframes is then combined rowwise to create the final multi-year species summary dataframe.

species_stub <- "speciesSummary-Sialia"

min_year <- 2019
max_year <- 2024
year_range <- min_year:max_year

# Clean the species file column names and read into dataframes
species_summary_dfs <- list()
for (yr in year_range) {
  
  year_str <- as.character(yr)
  species_file <- file.path(raw_data_path, paste0(species_stub, '-', year_str,".csv"))
  
  # Get new column names
  cleaned_col_names <- fix_species_colnames(species_file)
  
  # Read in file, skipping first line
  species_summary_dfs[[year_str]] <- read_csv(file = species_file, skip = 1, 
                                              col_names = FALSE, comment = '#',
                                              show_col_types = FALSE)
  
  # Reset the column names
  names(species_summary_dfs[[year_str]]) <- cleaned_col_names
  species_summary_dfs[[year_str]]$year <- yr
  
}
# Combine list of dataframes into one dataframe.
species_summary_df <- bind_rows(species_summary_dfs)
rm(species_summary_dfs)
str(species_summary_df)
spc_tbl_ [24 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ species                                  : chr [1:24] "Black-capped Chickadee" "Tree Swallow" "House Wren" "Eastern Bluebird" ...
 $ total_number_nesting_attempts            : num [1:24] 2 22 4 27 1 19 2 32 3 1 ...
 $ earliest_egg_date                        : Date[1:24], format: "2019-05-02" "2019-05-06" ...
 $ earliest_hatch_date                      : Date[1:24], format: "2019-05-22" "2019-05-24" ...
 $ earliest_fledge_date                     : Date[1:24], format: "2019-06-05" "2019-06-16" ...
 $ total_number_eggs                        : num [1:24] 8 77 10 89 4 75 3 118 0 6 ...
 $ total_number_nestlings                   : num [1:24] 8 42 9 79 4 58 3 104 0 0 ...
 $ total_number_fledglings                  : num [1:24] NA 38 9 78 4 57 3 103 0 0 ...
 $ nest_attempts_with_at_least_one_fledgling: num [1:24] 0 10 2 20 1 11 1 27 0 0 ...
 $ nesting_success_rate                     : chr [1:24] "0.0%" "45.5%" "50.0%" "74.1%" ...
 $ mean_clutch_size                         : num [1:24] 8 4.81 5 4.05 4 ...
 $ mean_nestlings                           : num [1:24] 8 3.23 4.5 3.59 4 ...
 $ mean_fledglings                          : num [1:24] NA 2.92 4.5 3.55 4 ...
 $ year                                     : int [1:24] 2019 2019 2019 2019 2019 2020 2020 2020 2020 2021 ...
 - attr(*, "spec")=
  .. cols(
  ..   X1 = col_character(),
  ..   X2 = col_double(),
  ..   X3 = col_date(format = ""),
  ..   X4 = col_date(format = ""),
  ..   X5 = col_date(format = ""),
  ..   X6 = col_double(),
  ..   X7 = col_double(),
  ..   X8 = col_double(),
  ..   X9 = col_double(),
  ..   X10 = col_character(),
  ..   X11 = col_double(),
  ..   X12 = col_double(),
  ..   X13 = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Data prep for the breeding summary file

There’s a little more work to do with this file than the other two. It’s the main source of detailed data needed for our analysis. We start, much like we just did with the species summaries, by reading in each year specific file and combining them into one breeding summary dataframe.

# Create list of dataframes - one df for each year
breeding_summary_dfs <- list()
for (yr in year_range){
  year_str = as.character(yr)
  breeding_file <- file.path(raw_data_path, paste0(breeding_stub, '-', year_str,".csv"))
  breeding_summary_dfs[[year_str]] <- read_csv(file = breeding_file, 
                                               col_names = TRUE, comment = '#',
                                               show_col_types = FALSE)
}
# Combine list of dataframes into one dataframe.
breeding_summary_df <- bind_rows(breeding_summary_dfs, .id = "site_year")

Then we do some changing of data types, cleaning column names, adding a new column and replacing NA values with 0’s.

# Change site_year to numeric
breeding_summary_df$site_year <- as.numeric(breeding_summary_df$site_year)
# Clean up the column names
breeding_summary_df <- breeding_summary_df %>% janitor::clean_names()
# Add area
breeding_summary_df$area <- str_extract(breeding_summary_df$site_name, '^[A-Z]+')

# Replace NA with 0 in some of the numeric columns
breeding_summary_df$no_of_fledged <- 
  replace_na(breeding_summary_df$no_of_fledged, 0)
breeding_summary_df$max_clutch_size <- 
  replace_na(breeding_summary_df$max_clutch_size, 0)
breeding_summary_df$total_live_young <- 
  replace_na(breeding_summary_df$total_live_young, 0)
breeding_summary_df$unhatched_eggs <- 
  replace_na(breeding_summary_df$unhatched_eggs, 0)

Then we do computation of hatch and fledge rates.

# Compute hatch rate and fledge rate (can compare to eBird's reported rates which are strings with %)
# ISSUE: max_clutch_size seems like it should equal total_live_young + unhatched_eggs. While it usually does,
# there are cases where this is not true. In such cases, which should be used for the denominator when computing
# hatch_rate? It seems that max_clutch_size is the correct denominator.

breeding_summary_df$hatch_rate_computed <- 
  breeding_summary_df$total_live_young / breeding_summary_df$max_clutch_size
breeding_summary_df$fledge_rate_computed <- 
  breeding_summary_df$no_of_fledged / breeding_summary_df$total_live_young 

Since many boxes see more than one nesting event per season, I thought it would be helpful to create a nesting event sequence number within each year for each nest box. As a first step, I wanted a datetime field that was some sort of proxy for the nesting event. Taking the minimum of the first egg date, the first hatch date and the first fledge date gives us such a datetime for those nesting attempts resulting in at least one egg. However, this misses all those failed nesting attempts. It seems that the only way to get at those is through the Site Visits table. Even without looking at that table, we can still tell there was a nesting event because all three first date fields have NA values. We just can’t sequence it. That will be a problem for another day. We are mostly focusing on true nesting attempts defined by at least one egg being documented.

# Start by computing min date of the three date fields so that we have
# a single date field to use for sequencing if needed. 
breeding_summary_df <- breeding_summary_df %>% 
  rowwise() %>% 
  mutate(seq_date = min(c(x1st_egg_date, x1st_hatch_date, x1st_fledge_date), na.rm = TRUE))

Finally, we can clean up and save an .RData file containing the three main dataframes we just created.

# Remove the list
rm(breeding_summary_dfs)

# Save all data for research
filename <- str_interp('data/nestwatch_sialia_${min_year}_${max_year}.RData')
save(breeding_summary_df, site_sf, species_summary_df, file=filename)

Basic summary analysis

We are finally ready to do some basic analysis.

load('data/nestwatch_sialia_2019_2024.RData')

How many boxes were monitored each year?

Let’s start with how many physical boxes we have of each type in the four main parks monitored. Not all of these boxes are active, they are just in the data table. For any query involving site_sf, we use the st_drop_geometry() function to prevent the geometry from showing up in our nice output tables.

site_sf |> 
  st_drop_geometry() |> 
  filter(area %in% c('BCNP', 'CIP', 'DTLP', 'PCT')) |> 
  group_by(area, box_type) |> 
  summarize(num_boxes = n(), .groups="rowwise") |>
  pivot_wider(names_from = box_type, values_from = num_boxes) %>% 
  replace(is.na(.), 0) |> 
  gt()
area Peterson Traditional Kestrel
BCNP 3 9 0
CIP 6 17 1
DTLP 0 9 0
PCT 0 10 0

What about the number of monitored box events each year? It is possible that a single box can have more than one nesting event in a season.

breeding_summary_df |> 
  filter(area %in% c('BCNP', 'CIP', 'DTLP', 'PCT')) |> 
  inner_join(site_sf, by=c("site_name")) |>
  st_drop_geometry() |> 
  group_by(site_year, box_type) |> 
  summarize(num_boxes = n(), .groups="rowwise") |> 
  rename(`Box type` = box_type) |> 
  ggplot(aes(x=as.factor(site_year), y=num_boxes, fill=`Box type`, label=num_boxes)) + 
  geom_bar(stat="identity") + 
  geom_text(size = 6, position = position_stack(vjust = 0.5)) +
  xlab("") + ylab("Number of nest box events") +
  ggtitle("Number of nest box events by year")

How many eggs, live births, and fledging events?

We will compute these by year and by species.

# By year
annual_counts <- breeding_summary_df |> 
  filter(species != "House Sparrow", species != "American Goldfinch") |> 
  filter(area %in% c('BCNP', 'CIP', 'DTLP', 'PCT')) |> 
  group_by(site_year) |> 
  summarize(.groups="rowwise", num_boxes = n(),
            tot_fledged = sum(no_of_fledged, na.rm = TRUE),
            tot_clutch_size = sum(max_clutch_size, na.rm = TRUE),
            tot_live_young = sum(total_live_young, na.rm = TRUE),
            tot_unhatched_eggs = sum(unhatched_eggs, na.rm = TRUE),
            overall_hatch_rate_1 = tot_live_young / (tot_live_young + tot_unhatched_eggs),
            overall_hatch_rate_2 = tot_live_young / tot_clutch_size,
            overall_fledge_rate = tot_fledged / tot_live_young)

# By year by species
annual_species_counts <- breeding_summary_df |> 
  filter(species != "House Sparrow", species != "American Goldfinch") |>
  group_by(site_year, species) |> 
  summarize(.groups="rowwise", num_boxes = n(),
            tot_fledged = sum(no_of_fledged, na.rm = TRUE),
            tot_clutch_size = sum(max_clutch_size, na.rm = TRUE),
            tot_live_young = sum(total_live_young, na.rm = TRUE),
            tot_unhatched_eggs = sum(unhatched_eggs, na.rm = TRUE),
            overall_hatch_rate_1 = tot_live_young / (tot_live_young + tot_unhatched_eggs),
            overall_hatch_rate_2 = tot_live_young / tot_clutch_size,
            overall_fledge_rate = tot_fledged / tot_live_young)

# NOTE: Using total clutch size in denominator for hatch rate until can
# confirm from Cornell how exactly to compute hatch rate.
annual_counts_nice <- breeding_summary_df |>
  filter(species != "House Sparrow", species != "American Goldfinch") |>
  group_by(site_year) |> 
  summarize(.groups="rowwise", `Nesting events` = n(),
            `Total Fledged` = sum(no_of_fledged, na.rm = TRUE),
            `Total Clutch` = sum(max_clutch_size, na.rm = TRUE),
            `Total Live Young` = sum(total_live_young, na.rm = TRUE),
            `Total Unhatched Eggs` = sum(unhatched_eggs, na.rm = TRUE),
            `Overall Hatch Rate` = `Total Live Young` / (`Total Clutch`),
            `Overall Fledge Rate` = `Total Fledged` / `Total Live Young`)

# Long version
annual_counts_long <- 
  pivot_longer(annual_counts_nice, -site_year, names_to = "indicator")

Now we can use these computed dataframes to drive plots and tables.

annual_counts_long |> 
  pivot_wider(id_cols = indicator, names_from = site_year, values_from = value) |> 
  gt() |> 
  tab_header(
    title = "Overall annual statistics",
    subtitle = "All species included"
  ) |> 
  fmt_number(columns = starts_with("2"), decimals = 2,
             drop_trailing_zeros = TRUE, drop_trailing_dec_mark = TRUE) 
Overall annual statistics
All species included
indicator 2019 2020 2021 2022 2023 2024
Nesting events 55 53 55 52 52 51
Total Fledged 125 163 147 159 140 155
Total Clutch 184 196 190 177 192 205
Total Live Young 138 165 153 166 168 173
Total Unhatched Eggs 18 31 36 17 24 26
Overall Hatch Rate 0.75 0.84 0.81 0.94 0.88 0.84
Overall Fledge Rate 0.91 0.99 0.96 0.96 0.83 0.9
annual_counts_long |> 
  filter(indicator %in% c("Total Fledged", "Total Clutch", "Total Live Young")) |> 
  mutate(indicator = factor(indicator, levels = c('Total Clutch', 'Total Live Young', 'Total Fledged'))) |> 
  ggplot() + 
  geom_line(aes(x = site_year, y = value, 
                colour = indicator)) +
  labs(x = "Year", y = "Total", title = "Total Clutch, Live Young, and Fledged")

Let’s look at number fledged and number of live young by species.

Fledged by species

annual_species_counts |> 
  filter(species != "House Sparrow", species != "American Goldfinch", !is.na(species)) |> 
  group_by(site_year, species) |> 
  summarize(.groups="rowwise", Fledged = sum(tot_fledged)) |> 
  pivot_wider(id_cols = species, names_from = site_year, values_from = Fledged) %>% 
  replace(is.na(.), 0) |> 
  gt() |> 
  tab_header(
    title = "Total Fledged",
    #subtitle = str_glue("{min_year} to {max_year}")
  )
Total Fledged
species 2019 2020 2021 2022 2023 2024
Black-capped Chickadee 0 0 0 0 0 0
Eastern Bluebird 78 103 90 104 90 76
House Wren 9 3 7 4 0 0
Tree Swallow 38 57 50 51 44 79
Northern House Wren 0 0 0 0 6 0
American Robin 0 0 0 0 0 0
annual_species_counts |> 
  filter(species %in% c("Eastern Bluebird", "Tree Swallow")) |> 
  ggplot() +
  geom_line(aes(x = site_year, y = tot_fledged, colour = species)) +
  labs(x = "Year", y = "Total fledged", title = "Total fledged by species")

Live young by species

annual_species_counts |> 
  filter(species != "House Sparrow", species != "American Goldfinch", !is.na(species)) |> 
  group_by(site_year, species) |> 
  summarize(.groups="rowwise", live_young = sum(tot_live_young)) |> 
  pivot_wider(id_cols = species, names_from = site_year, values_from = live_young) %>% 
  replace(is.na(.), 0) |>
  gt() |> 
  tab_header(
    title = "Total Live Young",
    #subtitle = str_glue("{min_year} to {max_year}")
  )
Total Live Young
species 2019 2020 2021 2022 2023 2024
Black-capped Chickadee 8 0 0 0 0 0
Eastern Bluebird 79 104 91 110 112 86
House Wren 9 3 10 4 0 0
Tree Swallow 42 58 52 52 50 87
Northern House Wren 0 0 0 0 6 0
American Robin 0 0 0 0 0 0
annual_species_counts |> 
  filter(species %in% c("Eastern Bluebird", "Tree Swallow")) |> 
  ggplot() +
  geom_line(aes(x = site_year, y = tot_live_young, colour = species)) +
  labs(x = "Year", y = "Total live young", title = "Total live young by species")

Hatch rate

The hatch rate is the total number of live young divided by the max clutch size observed during the nesting event.

annual_species_counts |> 
  filter(species == "Eastern Bluebird" | species == "Tree Swallow" | species == "House Wren") |> 
  group_by(site_year, species) |> 
  summarize(.groups="rowwise", hatch_rate = sum(overall_hatch_rate_2)) |> 
  pivot_wider(id_cols = species, names_from = site_year, values_from = hatch_rate) %>% 
  replace(is.na(.), 0) |>
  gt() |> 
  tab_header(
    title = "Total Hatch Rate",
    #subtitle = str_glue("{min_year} to {max_year}")
  ) |> 
  fmt_number(columns = starts_with("2"), decimals = 2)
Total Hatch Rate
species 2019 2020 2021 2022 2023 2024
Eastern Bluebird 0.89 0.88 0.88 0.99 0.85 0.84
House Wren 0.90 1.00 0.71 Inf 0.00 0.00
Tree Swallow 0.55 0.77 0.79 0.79 0.91 0.93
annual_species_counts |> 
  filter(species %in% c("Eastern Bluebird", "Tree Swallow")) |> 
  ggplot() +
  geom_line(aes(x = site_year, y = overall_hatch_rate_2, colour = species)) +
  labs(x = "Year", y = "Hatch rate", title = "Hatch rate by species")

Fledge rate

The fledge rate is the total number of birds fledged divided by the total number of live young.

annual_species_counts |> 
  filter(species == "Eastern Bluebird" | species == "Tree Swallow" | species == "House Wren") |> 
  group_by(site_year, species) |> 
  summarize(.groups="rowwise", fledge_rate = sum(overall_fledge_rate)) |> 
  pivot_wider(id_cols = species, names_from = site_year, values_from = fledge_rate) |> 
  gt() |> 
  tab_header(
    title = "Total Fledge Rate",
    #subtitle = str_glue("{min_year} to {max_year}")
  ) |> 
  fmt_number(columns = starts_with("2"), decimals = 2)
Total Fledge Rate
species 2019 2020 2021 2022 2023 2024
Eastern Bluebird 0.99 0.99 0.99 0.95 0.80 0.88
House Wren 1.00 1.00 0.70 1.00 NA NA
Tree Swallow 0.90 0.98 0.96 0.98 0.88 0.91
annual_species_counts |> 
  filter(species %in% c("Eastern Bluebird", "Tree Swallow")) |> 
  ggplot() +
  geom_line(aes(x = site_year, y = overall_fledge_rate, colour = species)) +
  labs(x = "Year", y = "Fledge rate", title = "Fledge rate by species")

We’ll stop here for now. In the next installment we’ll look to see if there are differences in nesting success in Peterson boxes versus the traditional boxes.

Citation

BibTeX citation:
@online{isken2025,
  author = {Isken, Mark},
  title = {Analyzing {NestWatch} Data with {R}},
  date = {2025-03-01},
  url = {https://bitsofanalytics.org//posts/nestwatch-data-analysis/analyzing_nestwatch_data_r.html},
  langid = {en}
}
For attribution, please cite this work as:
Isken, Mark. 2025. “Analyzing NestWatch Data with R.” March 1, 2025. https://bitsofanalytics.org//posts/nestwatch-data-analysis/analyzing_nestwatch_data_r.html.