Comparing predictive models for obstetrical unit occupancy using caret - Part 3
Put it all together
In Part 1 of this series of posts, we used caret
to train and test a few simulation metamodels using data from a series of simulation experiments on a simplified obstetrical patient flow network. In doing so, a number of questions popped up which suggested we needed a way to easily automate a number of partition, train, predict, summarize cycles with caret
.
In Part 2 we created a simple function named caret_ptps
to facilitate automating the process of partitioning a dataset, training a model, making predictions and summarizing the results. This function takes a number of arguments related to the dataframe to use, a model formula, a modeling method, and values to control the model training and testing phases.
<- function(model_formula, data, method, scenario="",
caret_ptps pct_train_val=0.6, partition.times=1,
control=trainControl(method="None"),
seed.partition=NULL, seed.resample=NULL,
...){
# Container list for results
<- length(pct_train_val) * partition.times
listsize <- vector("list", listsize)
results_for_df <- vector("list", listsize)
results_objs
# Get y variable name from formula object. The ~ is [[1]], LHS is [[2]] and RHS[[3]].
<- deparse(model_formula[[2]])
y.name
# Outer loop for training set size
<- 0
itemnum for (pct_train in pct_train_val){
# Create partitions for this pct_train.
<- createDataPartition(data[[y.name]], p = pct_train,
trainrecs list = FALSE, times = partition.times)
# Loop over data partitions
for (s in seq(1,partition.times)){
<- data[-trainrecs[,s], ]
modeling_test_df <- data[trainrecs[,s], ]
modeling_train_df
# Train the model
<- train(model_formula,
fit data = modeling_train_df,
method = method,
trControl = control,
...)
# Make predictions for the test set
<- predict(fit, newdata=modeling_test_df)
pred
# Gather actual and predicted values
<- fit$trainingData$.outcome
train_y_actual <- fitted.values(fit)
train_y_pred
<- modeling_test_df[, y.name]
test_y_actual <- pred
test_y_pred
# Compute RMSE for train and test
<- rmse(train_y_actual, train_y_pred)
rmse_train <- rmse(test_y_actual, test_y_pred)
rmse_test
# Store key items in lists for later processing
<- itemnum + 1
itemnum
<- list(ptrain = pct_train, sample = s, scenario = scenario,
result_for_df rmse_train = rmse_train,
rmse_test = rmse_test)
<- result_for_df
results_for_df[[itemnum]]
<- list(test_y_actual = test_y_actual,
result_objs test_y_pred = test_y_pred,
trained_model = fit)
<- result_objs
results_objs[[itemnum]]
}
}
<- list(rmse=results_for_df, fitpred=results_objs)
results_list return(results_list)
}
Now we are ready to use this function as we iterate over a dataframe of scenarios to analyze with caret
. Part 1 ended with:
Just this little bit of copy, paste, editing was frought with errors and tediously repititious. > > Clearly we need to encapsulate this process in one or more functions. Ideally I also wanted to be able to create analysis scenarios defined by a bunch of attributes that I could store in a metadata file. To make this concrete, here’s a little csv file illustrating what I wanted to do.
<- read.csv(file="data/scenarios_test.csv")
scenarios_test_df ::kable(scenarios_test_df) knitr
scen | model_formula | data | method | scenario | pct_train_val | partition.times | control | seed.partition | seed.resample | extra_args |
---|---|---|---|---|---|---|---|---|---|---|
1 | occmean_pp ~ load_pp | obsim_df | lm | occmean_pp_load1 | seq(0.5,0.9,0.1) | 10 | fitControl | 231 | 37 | |
2 | occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate | obsim_df | lmStepAIC | occmean_pp_lmstep1 | seq(0.5,0.9,0.1) | 10 | fitControl | 231 | 37 | |
3 | occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate | obsim_df | nnet | occmean_pp_nnet1 | seq(0.5,0.9,0.1) | 10 | fitControl | 231 | 37 | linout=TRUE |
4 | occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate | obsim_df | rf | occmean_pp_rf1 | seq(0.5,0.9,0.1) | 10 | fitControl | 231 | 37 | |
5 | occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate | obsim_df | knn | occmean_pp_knn1 | seq(0.5,0.9,0.1) | 10 | fitControl | 231 | 37 | preProcess=c(‘center’, ‘scale’) |
6 | occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate | obsim_df | gamSpline | occmean_pp_spline1 | seq(0.5,0.9,0.1) | 10 | fitControl | 231 | 37 |
Obviously, when we read this csv file into an R dataframe, we are going to end up with some numeric fields and some string fields. Well, the string fields will come in as factors by default but we can use the stringsAsFactors=FALSE
argument to prevent this.
<- read.csv(file="data/scenarios_test.csv",
scenarios_test_df stringsAsFactors = FALSE)
str(scenarios_test_df)
## 'data.frame': 6 obs. of 11 variables:
## $ scen : int 1 2 3 4 5 6
## $ model_formula : chr "occmean_pp ~ load_pp" "occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate" "occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate" "occmean_pp ~ lam_pp + alos_pp + cap_pp + tot_c_rate" ...
## $ data : chr "obsim_df" "obsim_df" "obsim_df" "obsim_df" ...
## $ method : chr "lm" "lmStepAIC" "nnet" "rf" ...
## $ scenario : chr "occmean_pp_load1" "occmean_pp_lmstep1" "occmean_pp_nnet1" "occmean_pp_rf1" ...
## $ pct_train_val : chr "seq(0.5,0.9,0.1)" "seq(0.5,0.9,0.1)" "seq(0.5,0.9,0.1)" "seq(0.5,0.9,0.1)" ...
## $ partition.times: int 10 10 10 10 10 10
## $ control : chr "fitControl" "fitControl" "fitControl" "fitControl" ...
## $ seed.partition : int 231 231 231 231 231 231
## $ seed.resample : int 37 37 37 37 37 37
## $ extra_args : chr "" "" "linout=TRUE" "" ...
One of our primary challenges will be to use string representations or names of things like formulas, dataframes, control objects and R expressions within R code to get at the actual underlying objects. Let’s start by considering a few specific cases.
First, let’s read in the dataset needed for the rest of this post.
<- read.csv(file="data/obsim_example.csv")
obsim_df names(obsim_df)[1] <- "scenario"
# Shorten a few column names
names(obsim_df)[13] <- "occmean_ldr"
names(obsim_df)[14] <- "occp95_ldr"
names(obsim_df)[15] <- "occmean_pp"
names(obsim_df)[16] <- "occp95_pp"
Load some libraries.
library(ggplot2)
library(dplyr)
library(Metrics)
library(caret)
R formulas to text and back
The scenarios_test_df
dataframe has text values that look like R formula objects.
<- scenarios_test_df$model_formula[1]
simple_formula_str simple_formula_str
## [1] "occmean_pp ~ load_pp"
class(simple_formula_str)
## [1] "character"
An R formula object is not a string. It’s an R formula object.
<- occmean_pp ~ load_pp
simple_formula simple_formula
## occmean_pp ~ load_pp
## <environment: 0x1640188>
class(simple_formula)
## [1] "formula"
Many R functions take formula objects as arguments. For example, lm
:
lm(simple_formula, data = obsim_df)
##
## Call:
## lm(formula = simple_formula, data = obsim_df)
##
## Coefficients:
## (Intercept) load_pp
## 0.01363 1.00295
As you might have guessed, lm
probably will do the right thing if we pass in a string that can be coerced to a valid R formula.
lm(simple_formula_str, data = obsim_df)
##
## Call:
## lm(formula = simple_formula_str, data = obsim_df)
##
## Coefficients:
## (Intercept) load_pp
## 0.01363 1.00295
Great. However, we are using caret
which supports over 200 different modeling techniques. Will caret
work if its train
function is given a formula string instead of the formula it expects?
First we need to partition the dataset and setup our training control object.
<- 233
train_test_seed <- 67
resample_seed
<- 125/150
pct_train set.seed(train_test_seed)
<- createDataPartition(obsim_df$occmean_pp, p = pct_train,
trainrecs list = FALSE, times = 1)
<- obsim_df[trainrecs, ]
obsim_train_df <- obsim_df[-trainrecs, ]
obsim_test_df
<- trainControl(## 5-fold CV
fitControl method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 10)
First we’ll explicitly pass in a formula object to make sure everything is working fine.
set.seed(resample_seed)
<- train(occmean_pp ~ load_pp, data = obsim_train_df,
train_load1 method = "lm",
trControl = fitControl)
coef(summary(train_load1))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01324359 0.0067156390 1.972052 0.05082951
## load_pp 1.00299005 0.0001896324 5289.128771 0.00000000
Now confirm that we can pass in a formula object variable. Of course we can.
set.seed(resample_seed)
<- train(simple_formula, data = obsim_train_df,
train_load1 method = "lm",
trControl = fitControl)
coef(summary(train_load1))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01324359 0.0067156390 1.972052 0.05082951
## load_pp 1.00299005 0.0001896324 5289.128771 0.00000000
What if we pass in the string version of the formula?
set.seed(resample_seed)
<- train(simple_formula_str, data = obsim_train_df,
train_load1_str method = "lm",
trControl = fitControl)
## Error in train.default(simple_formula_str, data = obsim_train_df, method = "lm", : argument "y" is missing, with no default
Does not work.
Turns out to be simple to be simple to create a formula object from a string using R’s as.formula
function.
<- as.formula(simple_formula_str)
simple_formula_from_str class(simple_formula_from_str)
## [1] "formula"
set.seed(resample_seed)
<- train(simple_formula_from_str, data = obsim_train_df,
train_load1 method = "lm",
trControl = fitControl)
coef(summary(train_load1))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01324359 0.0067156390 1.972052 0.05082951
## load_pp 1.00299005 0.0001896324 5289.128771 0.00000000
Success. What about the other direction: formula –> string. While that feels like it should be simple, there are complications. Let’s try the obvious thing, the as.character
function.
as.character(simple_formula)
## [1] "~" "occmean_pp" "load_pp"
Looks like we get a vector of strings. Interesting. The left hand side of the equation is the second element and the right hand side, the third. That will come in handy later when we need to pluck the response variable name from a formula. Doesn’t solve the immediate problem, but useful.
A little searching led to the R deparse
function. For example, see https://stackoverflow.com/questions/14671172/how-to-convert-r-formula-to-text.
deparse(simple_formula)
## [1] "occmean_pp ~ load_pp"
You do need to be a little careful in that the default behavior for deparse
will start inserting line breaks after 60 characters. No problem, we can specify a larger limit before line breaking using the width.cutoff
argument. There’s also an nlines
argument that specifies the maximum number of lines to produce.
deparse(simple_formula, width.cutoff = 500, nlines = 1)
## [1] "occmean_pp ~ load_pp"
R dataframes from string version of their variable name
The scenarios_test_df
dataframe has a column containing the name of the dataframe we want to use for our caret
work.
<- scenarios_test_df$data[1]
dfname_str dfname_str
## [1] "obsim_df"
Let’s play a similar game with lm
as we did with the formula.
lm(simple_formula, data=dfname_str)
## Error in eval(predvars, data, env): invalid 'envir' argument of type 'character'
Not that we really expected R to coerce a string that just happens to coincide with a dataframe name in the current environment, but worth a try. Instead, we must use the get
function.
lm(simple_formula, data=get(dfname_str))
##
## Call:
## lm(formula = simple_formula, data = get(dfname_str))
##
## Coefficients:
## (Intercept) load_pp
## 0.01363 1.00295
R expressions from strings
One of the things we want to experiment with is the percentage of cases to allocate to the training data set. In the example here, we see that the pct_train_val
column contains the following string representing a sequence.
$pct_train_val[1] scenarios_test_df
## [1] "seq(0.5,0.9,0.1)"
So, how can we generate the actual sequence from this string?
seq(0.5,0.9,0.1)
## [1] 0.5 0.6 0.7 0.8 0.9
One approach, though seemingly frowned upon, relies on a combination of the parse
and eval
functions. We use parse
to create an R expression from a string and eval
to evaluate it. Note that parse
is usually used with files and so we need to use the text=
argument to give it a string to parse.
<- "seq(0.5,0.9,0.1)"
seq_str parse(text = seq_str)
## expression(seq(0.5, 0.9, 0.1))
Now, evaluate it.
eval(parse(text = seq_str))
## [1] 0.5 0.6 0.7 0.8 0.9
See https://stackoverflow.com/questions/7836972/use-character-string-as-function-argument for some discussion of this. One of the answers includes this gem:
library(fortunes)
fortune(106)
##
## If the answer is parse() you should usually rethink the question.
## -- Thomas Lumley
## R-help (February 2005)
Calling functions with list of arguments
Now we want to call the caret_ptps
function using the arguments stored in a row of the dataframe scenarios_test_df
. Consider the first scenario.
print(scenarios_test_df[1,])
## scen model_formula data method scenario
## 1 1 occmean_pp ~ load_pp obsim_df lm occmean_pp_load1
## pct_train_val partition.times control seed.partition seed.resample
## 1 seq(0.5,0.9,0.1) 10 fitControl 231 37
## extra_args
## 1
Let’s put these values from the first scenario row into a set of variables. For some of the fields we’ll use one of the techniques discussed above to convert it from a string to an R object or R expression.
# Grab first row
<- scenarios_test_df[1,]
argdf
<- as.formula(argdf$model_formula)
model_formula <- get(argdf$data)
data <- argdf$method
method <- argdf$scenario
scenario <- eval(parse(text=argdf$pct_train_val))
pct_train_val <- argdf$partition.times
partition.times <- get(argdf$control)
control <- argdf$seed.partition
seed.partition <- argdf$seed.resample
seed.resample
# For now, extra_args is left as a string. We'll deal with this complication later in
# this post.
<- ifelse(length(argdf$extra_args)>0, argdf$extra_args, NULL) extra_args
Now we can use the R function do.call
to call caret_ptps
with a list of named elements corresponding to the function arguments. Pretty powerful. See the documentation for more info. Start by creating a list containing the arguments to pass.
<- list(model_formula=model_formula, data=data, method=method, scenario=scenario,
arglist pct_train_val=pct_train_val, partition.times=partition.times,
control=control, seed.partition=seed.partition,
seed.resample=seed.resample)
str(arglist)
## List of 9
## $ model_formula :Class 'formula' language occmean_pp ~ load_pp
## .. ..- attr(*, ".Environment")=<environment: 0x1640188>
## $ data :'data.frame': 150 obs. of 16 variables:
## ..$ scenario : int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ lam_ldr : num [1:150] 2.74 2.74 2.74 2.74 2.74 ...
## ..$ alos_ldr : num [1:150] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## ..$ cap_ldr : int [1:150] 4 4 4 3 3 3 2 2 2 8 ...
## ..$ load_ldr : num [1:150] 1.37 1.37 1.37 1.37 1.37 ...
## ..$ rho_ldr : num [1:150] 0.342 0.342 0.342 0.457 0.457 ...
## ..$ lam_pp : num [1:150] 2.74 2.74 2.74 2.74 2.74 ...
## ..$ alos_pp : num [1:150] 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 2.2 ...
## ..$ cap_pp : int [1:150] 12 10 9 12 10 9 12 10 9 29 ...
## ..$ load_pp : num [1:150] 6.03 6.03 6.03 6.03 6.03 ...
## ..$ rho_pp : num [1:150] 0.502 0.603 0.67 0.502 0.603 ...
## ..$ tot_c_rate : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 ...
## ..$ occmean_ldr: num [1:150] 1.37 1.43 1.52 1.32 1.36 ...
## ..$ occp95_ldr : num [1:150] 3.68 3.99 4 3 3 ...
## ..$ occmean_pp : num [1:150] 6.05 6.02 6.02 6.02 6.03 ...
## ..$ occp95_pp : num [1:150] 10.2 10 9 10.2 10 ...
## $ method : chr "lm"
## $ scenario : chr "occmean_pp_load1"
## $ pct_train_val : num [1:5] 0.5 0.6 0.7 0.8 0.9
## $ partition.times: int 10
## $ control :List of 26
## ..$ method : chr "repeatedcv"
## ..$ number : num 5
## ..$ repeats : num 10
## ..$ search : chr "grid"
## ..$ p : num 0.75
## ..$ initialWindow : NULL
## ..$ horizon : num 1
## ..$ fixedWindow : logi TRUE
## ..$ verboseIter : logi FALSE
## ..$ returnData : logi TRUE
## ..$ returnResamp : chr "final"
## ..$ savePredictions : logi FALSE
## ..$ classProbs : logi FALSE
## ..$ summaryFunction :function (data, lev = NULL, model = NULL)
## ..$ selectionFunction: chr "best"
## ..$ preProcOptions :List of 3
## .. ..$ thresh : num 0.95
## .. ..$ ICAcomp: num 3
## .. ..$ k : num 5
## ..$ sampling : NULL
## ..$ index : NULL
## ..$ indexOut : NULL
## ..$ indexFinal : NULL
## ..$ timingSamps : num 0
## ..$ predictionBounds : logi [1:2] FALSE FALSE
## ..$ seeds : logi NA
## ..$ adaptive :List of 4
## .. ..$ min : num 5
## .. ..$ alpha : num 0.05
## .. ..$ method : chr "gls"
## .. ..$ complete: logi TRUE
## ..$ trim : logi FALSE
## ..$ allowParallel : logi TRUE
## $ seed.partition : int 231
## $ seed.resample : int 37
Now we can call caret_ptps
.
<- do.call(caret_ptps, arglist) results_scenario1
Pull out the first element and convert to a dataframe. Display the head and tail of the results summary.
<- results_scenario1[[1]]
results_for_df <- do.call(rbind,lapply(results_for_df, data.frame)) results_df
print(head(results_df))
## ptrain sample scenario rmse_train rmse_test
## 1 0.5 1 occmean_pp_load1 0.03235569 0.04107803
## 2 0.5 2 occmean_pp_load1 0.03145381 0.04147086
## 3 0.5 3 occmean_pp_load1 0.03713965 0.03612767
## 4 0.5 4 occmean_pp_load1 0.03503569 0.03825939
## 5 0.5 5 occmean_pp_load1 0.03663228 0.03691576
## 6 0.5 6 occmean_pp_load1 0.03609778 0.03719239
print(tail(results_df))
## ptrain sample scenario rmse_train rmse_test
## 45 0.9 5 occmean_pp_load1 0.03607652 0.04242421
## 46 0.9 6 occmean_pp_load1 0.03713654 0.02999180
## 47 0.9 7 occmean_pp_load1 0.03656720 0.03718645
## 48 0.9 8 occmean_pp_load1 0.03673756 0.03520638
## 49 0.9 9 occmean_pp_load1 0.03587340 0.04446830
## 50 0.9 10 occmean_pp_load1 0.03788447 0.01594970
Here’s a quick box plot of the test RMSE distribution by the percentage of cases allocated to the training data.
ggplot(data=results_df) + geom_boxplot(aes(x=as.factor(ptrain), y=rmse_test), colour="tan2") + ggtitle(scenario)
For convenience, I created a helper function to create a list of named arguments from a single row of a dataframe. At the bottom of the function you’ll see that it includes code to deal with one extra argument to be passed on to the caret
train function. For example, for the nnet
method we need to give the extra argument linout=TRUE
to tell nnet
to create regression output instead of classification output. Similarly, for knn
we need preProcess=c('center', 'scale')
. It would be easy to extend our function to handle any number of additional arguments.
<- function(df){
caret_ptps_arglister
# Just using the first row, no matter how many rows in data passed in.
# Hacky but fine for now.
<- df[1,]
argdf
<- as.formula(argdf$model_formula)
model_formula <- get(argdf$data)
data <- argdf$method
method <- argdf$scenario
scenario <- eval(parse(text=argdf$pct_train_val))
pct_train_val <- argdf$partition.times
partition.times <- get(argdf$control)
control <- argdf$seed.partition
seed.partition <- argdf$seed.resample
seed.resample <- ifelse(length(argdf$extra_args)>0, argdf$extra_args, NULL)
extra_args
<- list(model_formula=model_formula, data=data,
arglist method=method, scenario=scenario,
pct_train_val=pct_train_val, partition.times=partition.times,
control=control, seed.partition=seed.partition,
seed.resample=seed.resample)
# For now assuming only one additional argument specified.
if (!is.null(extra_args)) {
<- strsplit(extra_args, "=")
arg <- arg[[1]][1]
argname <- eval(parse(text=arg[[1]][2]))
argval
<- argval
arglist[[argname]]
}
return(arglist)
}
Put it all together
Let’s use a simple loop to drive the process.
# Create an empty list to hold the results. Avoiding doing appends to a list.
<- nrow(scenarios_test_df)
maxlistsize <- vector("list", maxlistsize)
results_list
# Iterate over rows in scenario dataframe
for (i in 1:nrow(scenarios_test_df)) {
# Grab current row
<- scenarios_test_df[i,]
row # Create argument list corresponding to current scenario row
<- caret_ptps_arglister(row)
arglist # Partition, train, predict, summarize for current scenario row
<- do.call(caret_ptps, arglist)
results # We'll just save the RMSE and scenario values for this example
<- do.call(rbind,lapply(results[[1]], data.frame, stringsAsFactors=FALSE))
results_df # Save results dataframe to list
<- results_df
results_list[[i]] }
Now lets turn this list of dataframes into a single dataframe and make some summary plots. See https://stackoverflow.com/questions/2851327/convert-a-list-of-data-frames-into-one-data-frame for good discussion of various approaches to this common task. I’m going to use bind_rows
from the dplyr
package.
<- as.data.frame(bind_rows(results_list)) results_df
Now we can visualize the distribution of RMSE for the different models and different training set sizes.
ggplot(data=results_df) + geom_boxplot(aes(x=as.factor(ptrain), y=rmse_test), colour="tan2") + facet_wrap(~scenario)
Let’s replot without knn.
ggplot(data=results_df[results_df$scenario != 'occmean_pp_knn1',]) + geom_boxplot(aes(x=as.factor(ptrain), y=rmse_test), colour="tan2") + facet_wrap(~scenario)
Next steps
Now that I’ve got the basic idea working, I’ll build on this as I work on the research project and address the questions that originally prompted this series of posts. While working on this approach, I was talking to a former student of mine who is now working in the healthcare analytics field. As I was describing this, he said “that’s the kind of thing I did in my capstone project in the Master of Information and Data Science program at UC-Berkeley”. His team’s project resulted in a tool called autoCaret that strives to make it easier to get started with machine learning projects by wrapping and automating some of the steps. They focused on binary classification problems and also included a nice GUI in addition to the R package itself. And, it’s freely available on GitHub at https://github.com/gregce/autoCaret.
Reuse
Citation
@online{isken2016,
author = {Mark Isken},
title = {Comparing Predictive Models for Obstetrical Unit Occupancy
Using Caret - {Part} 3},
date = {2016-06-30},
langid = {en}
}