Prepatory Material

Site Conventions

We assume a basic level of knowledge about the R statistical computing environment as well as awareness of common regression methods available through R. If you are new to R, there are numerous on-line tutorials, discussion groups, and blogs available; for example R For Beginners by Emmanuel Paradis.

Code Items in Text

We use a variety of font decorations when referring in text passages to packages, functions, and input arguments. Package names are printed in bold packageName. Functions that are defined by a package in R are highlighted in blue font, functionName(). Functions that we define through the website are highlighted in green; functionName(). Finally, input arguments for functions are in bold text-type argumentName.

If a function exists in R but is not provided in the base (default) package, we include the package name using the double colon notation, which is the convention of R. For example, lm(), which fits linear models, is defined by package stats and is referenced as stats::lm() in the text or as stats::lm() in code.

Function Arguments

As mentioned previously, function argument names in text sections are bold text-type, argumentName. In addition, for code blocks, we always use the named argument input convention (name = object). Many functions allow for unnamed, positional inputs; however, we believe that named inputs provide more clarity.

For example, we will write the code to generate 100 random numbers from a normal distribution with mean 10 and standard deviation 2 as

stats::rnorm(n = 100, mean = 10, sd = 2)

  [1]  8.167082  8.265135 11.750101 12.823050  9.817039 12.971156 13.097068
  [8] 10.344266 15.192547  8.951312 10.192172  5.241290 10.286082 11.255429
 [15] 11.721025 11.896537 10.158942 12.745745 10.044947  8.355021  8.532626
 [22] 10.446595  7.957810 12.723588  9.979115 10.219141  8.244518  8.395776
 [29] 10.250264  9.798035  8.170835 11.931790  8.900916  9.411897  8.484375
 [36]  9.605591 12.386412 11.770323 12.505528  7.199994  8.534601  9.412743
 [43] 14.017033  8.626459 10.049519  8.859831 11.460698  9.245987 11.783126
 [50]  7.390956 11.622964  7.443781  8.515167 12.142947  9.646388  9.809504
 [57] 11.954840 11.039984 12.012906 14.171457  8.724991  8.176391  9.993004
 [64] 12.387336  9.283920  6.868521  7.508926 11.393521  9.833981  9.347148
 [71]  9.549741  9.084681  9.240683 11.131544  7.019987  6.579202  7.602609
 [78]  8.941636 12.273428 12.592533  8.403796  9.295288 15.314072  5.445649
 [85]  8.530123 10.659008 11.296296  9.689286  6.439194  8.956625  8.219575
 [92]  7.575232  7.842680 13.619029 10.693222  9.233547 10.463085 11.726034
 [99]  9.177021 13.536567

though R would also understand

stats::rnorm(100, 10, 2)

  [1]  9.271698  9.192661 10.725390 14.811142  9.716182  8.752896 10.753793
  [8] 10.535138 12.410409 12.764792 10.232825  6.620746 10.030879 10.607405
 [15]  6.595075 13.265181 11.810778  9.331417 10.994086 10.058339 10.090701
 [22]  9.453874  6.953572 10.934352  7.375328 12.758642 12.978439 10.125703
 [29]  8.011926  7.081048 12.185786  8.443431 12.499837  8.714348 11.413399
 [36]  9.589082 13.171304 11.300317 12.272204  8.447132 11.390216 11.753112
 [43]  7.644975  7.183854 10.797007  7.292401  8.593925 10.174816 11.748587
 [50] 13.721499 10.902068 11.424707  8.163498 11.138128  9.065724  8.807123
 [57] 11.632177  8.277435  9.414094 11.121433 13.071885 12.599860  7.926980
 [64]  9.349403 13.106993  8.584079 11.247641 10.744198  8.330467  8.933934
 [71] 11.793538  9.596610 15.001984  9.435056  6.873102  7.789240  8.378898
 [78]  8.422043  9.515749  5.353567 11.533838  9.003131  9.745201 10.446663
 [85] 10.383406  9.747556  9.739211  8.205746 11.155214  7.327233 11.173334
 [92] 11.451706  9.147778  8.114784  7.471612 13.408158 10.665739 10.517753
 [99] 10.384110  8.325173

to be equivalent. (Though the calls to stats::rnorm() are equivalent, the results are not because we did not reset the seed.)

Numeric vs Integer

All numeric values include a decimal. All integers are indicated using R’s “L” notation; i.e. the integer value 1 is coded as 1L. This allows for more efficient and robust tests for equality.

Curly Brackets vs Parentheses

We group expressions using curly brackets. For example, our convention is to write

res <- {a + b} * {c + d}

in contrast to

res <- (a + b) * (c + d)

Both are acceptable in R, but there are circumstances under which curly brackets can lead to improved speed.

Help with Functions

If you are not familiar with a function used in our implementations (for illustration assume the method is stats::lm()) you can type ?lm or ?stats::lm at the command prompt of R to obtain the official documentation or type “lm R CRAN” into your favorite search engine to locate blog posts, articles, or tutorials.

Modeling Objects

The methods discussed on this website rely on regressions. We make use of R’s modelObj package, which facilitates the implementation of statistical methods using a modeling object framework. This framework separates the details of a regression step from the implementation of a new method. We briefly introduce the features of this package that we use throughout the book.

For some of the early chapters, the modelObj framework may seem a bit heavy-handed. However, in later chapters we discuss the methods for dynamic treatment regimes as implemented in package DynTxRegime, which was built on the modelObj framework. It is our hope that introducing the framework early in our discussion will facilitate their use in those more complex settings.

Introduction

A modeling object can be thought of as a Class, such as those encountered in high-level languages C++ and Java. In the traditional language of classes, a modeling object has ‘state variables’ that include the postulated model, the method to be used to estimate parameters, and the method to be used to make predictions. The object also has behaviors such as ‘obtain parameter estimates’ and ‘make predictions.’

This framework essentially separates the implementation of a statistical method that requires a regression step from the details of the regression step. Specifically, a developer does not need to specify a specific regression method (and its inputs!) such as stats::lm() or stats::nls(), but simply passes a modeling object and the appropriate data to a generic ‘fit’ method. The package takes care of the rest. Through this framework, developers can avoid introducing artificial limitations of coding choices regarding supported regression methods.

Users of methods developed on the modeling object framework provide the details of a regression step as a compact input variable. The defined object contains all control parameters for regression and prediction.

This modeling object framework has been implemented in R package modelObj. Users of methods developed using modelObj define modeling objects through function modelObj::buildModelObj().

Function buildModelObj()

modelObj::buildModelObj(model,
                        solver.method = NULL, 
                        solver.args = NULL,
                        predict.method = NULL, 
                        predict.args = NULL)

Input model is a standard R formula object specifying the model to be used for the regression step, e.g., y ∼ x. Note that the left-hand-side of the formula will be ignored.

Inputs solver.method and solver.args specify the method to be used to obtain parameter estimates. solver.method is a character string specifying the R function, e.g., “lm,” “nls,” or “glm”. solver.args is an optional named list that is used to modify default values of arguments passed to the function specified in solver.method. See the examples below for further details.

Similarly, inputs predict.method and predict.args specify the method to be used to obtain predictions. predict.method is a character string specifying the R function, e.g., “predict.lm,” “predict,” or “predict.glm”. predict.args is an optional named list that is used to modify default values of arguments passed to the function specified in predict.method.

Examples

Continuous Outcome Variable

To illustrate, assume that our data set, data, contains covariates \(x1\) and \(x2\) and a continuous outcome variable, \(y\). We postulate a model \(y \sim \beta_0 + \beta_1~x1 + \beta_2~x2\) and want to obtain parameter estimates using ordinary least squares through R’s stats::lm() function. Typically, this is coded in R as

fit <- stats::lm(formula = y ~ x1 + x2, data = data)

And, predictions from that analysis would be obtained as

pred <- stats::predict.lm(object = fit)

The modeling object defining these regression and prediction steps is specified as

mo <- modelObj::buildModelObj(model = ~ x1 + x2, 
                              solver.method = "lm",
                              predict.method = "predict.lm")

Notice that we did not explicitly include the package name stats when specifying solver.method or predict.method. Because we specify the function names using character strings, we cannot include the package name. With this choice of input style, the stats package must be loaded into your R environment prior to calling modelObj::buildModelObj(). Users can instead provide the function itself (solver.method = stats::lm); however, this passes the entire function as input and results in cumbersome screen prints. Our choice here is purely for aesthetics.

Binary Outcome Variable

For a binary outcome, \(y,\) we postulate the logistic regression model \(\text{logit}(y) \sim \beta_0 + \beta_1 x_1 + \beta_2 x_2.\) Parameter estimates are obtained using maximum likelihood through R’s stats::glm() function.

fit <- stats::glm(formula = y ~ x1 + x2, 
                  data = data, 
                  family = binomial)

Notice that an additional input is required; namely, family=binomial. This input specifies the family of the model and the link function; the default for stats::glm() is family=gaussian.

Predictions from the above analysis would be obtained as

pred <- stats::predict.glm(object = fit, type = "response")

Again there is a change to the default inputs; i.e., type = “response”. The default input for stats::predict.glm() is type = “link”, which returns the predictions on the scale of the linear predictors not the outcome variable.

The modeling object defining these regression and prediction steps is specified as

mo <- modelObj::buildModelObj(model = ~ x1 + x2, 
                              solver.method = "glm",
                              solver.args = list(family = "binomial"),
                              predict.method = "predict.glm",
                              predict.args = list(type = "response"))

Limitations of modelObj

There are two important limitations to this framework that must be kept in mind whenever using packages or functions that are built on the modelObj framework.

There is no built-in model checking. It is the responsibility of the user to define models responsibly.
Care must be taken in specifying the scale of the predictions, as in the binary example above.

Methods for Modeling Objects

Package modelObj has several methods available. In this chapter, we will make extensive use of only three of them: modelObj::fit(), modelObj::predict(), and modelObj::fitObject().

modelObj::fit()

modelObj::fit(object, data, response, ...)

Function modelObj::fit() performs the regression step. It takes as input object, a modeling object returned by a call to modelObj::buildModelObj(); data, a data.frame containing all relevant covariates; and response, the outcome of interest. This function returns an object of class “modeObjFit,” which includes the regression results.

modelObj::predict()

modelObj::predict(object, ...)

Function modelObj::predict() obtains predictions. This function takes as input object, an object returned by modelObj::fit(), and an optional input newdata containing a data.frame for which predictions are to be made. The value returned is generally a vector or matrix of predictions; the exact structure is determined by the prediction method specified in the modeling object.

modelObj::fitObject()

modelObj::fitObject(object, ...)

Function modelObj::fitObject() is a utility function that strips away the modeling object framework and returns the value object of the regression method. This allows users to complete post-analysis steps in the expected way. For example, calls to R’s stats::lm() method return objects of class “lm”, to which methods such as stats::residuals() and stats::plot.lm() can be applied.