Building Packages in R – Part 1: The Skeleton

Last week we took some time to set up the prerequisites of building R packages, so if you haven’t done so already, feel free to take a look at Part 0: Setting Up R. This week we will look at the basic structure of R-Packages – its skeleton if you will.

The Building Blocks: Functions

As you are probably aware, packages are nothing more than collections of functions. Most of the time they are a collection of interconnected functions which stem from a certain area of statistics. In this post we will not go into the nitty-gritty details of how to write functions or what you should and shouldn’t do. Instead, this is just a quick refresher. The following code was used by Frederick Ho in his first post on longitudinal analysis with multilevel growth models, I’ve just wrapped it in a function to make it reusable with as little effort as possible.

trajectories <- function(model, occasions=1:3, timevar='time', confidence=.95) {

  dat.new <- data.frame(occasions)
  names(dat.new)  <- timevar

  dat.new$measure <- predict(model, dat.new, re.form=NA)

  zcrit <- qnorm(c((1-confidence)/2,1-(1-confidence)/2))
  m.mat <- model.matrix(terms(model), dat.new)

  dat.new$var <- diag(m.mat %*% vcov(model) %*% t(m.mat)) + unlist(VarCorr(model))
  dat.new$pvar <- dat.new$var + sigma(m)^2
  dat.new$ci.lb <- with(dat.new, measure + zcrit[1] * sqrt(var))
  dat.new$ci.ub <- with(dat.new, measure + zcrit[2] * sqrt(var))
  dat.new$pi.lb <- with(dat.new, measure + zcrit[1] * sqrt(pvar))
  dat.new$pi.ub <- with(dat.new, measure + zcrit[2] * sqrt(pvar))
  
  return(dat.new)
}

If you’re not familiar with Frederick’s post, go check it out. For a quick recap it is sufficient to know, that this function calculates the confidence as well as the prediction interval around the value predicted in a growth model. It takes four arguments model, occasions, timevar, and confidence.

The first, as the name might suggest, expects a model – specifically an object of class lmerMod. Because this function doesn’t have an explicit fail-safe providing a model from some other package for mixed-effects modeling (e.g. nlme) will likely throw a cryptic error. But, as is, it should work with lme4. The second argument, occasions, expects a vector of occasions. Using the seq-function you ask for any length or precision in the depicted time. By providing it with a default (in this case a sequence from 1 to 3) the argument does not need to be specified every time. The same is true for the timevar-argument which wants the name of variable in the model used to code time. Finally, the function allows you to provide a confidence level of your choosing with the omnipresent .95 being the default.

Adding A Dataset

Most packages come with data sets for testing out the functions. These can usually be called via data and will most often be what is used in the examples provided in the help-files. Again, we will use Frederick’s example:

library(MASS)

set.seed(140915)
dat.tx.a <- mvrnorm(n=250, mu=c(30, 20, 28), 
                    Sigma=matrix(c(25.0, 17.5, 12.3, 
                                   17.5, 25.0, 17.5, 
                                   12.3, 17.5, 25.0), nrow=3, byrow=TRUE))

dat.tx.b <- mvrnorm(n=250, mu=c(30, 20, 22), 
                    Sigma=matrix(c(25.0, 17.5, 12.3, 
                                   17.5, 25.0, 17.5, 
                                   12.3, 17.5, 25.0), nrow=3, byrow=TRUE))

fred <- data.frame(rbind(dat.tx.a, dat.tx.b))
names(fred) <- c('measure.1', 'measure.2', 'measure.3')

fred <- data.frame(subject.id = factor(1:500), tx = rep(c('A', 'B'), each = 250), fred)

rm(dat.tx.a, dat.tx.b)

fred <- reshape(fred, varying = c('measure.1', 'measure.2', 'measure.3'), 
               idvar = 'subject.id', direction = 'long')

The only difference to what you will find between this code and the one that Frederick used in his three past posts on longitudinal data analysis is that I used the set.seed function to make all values be exactly the same every time I run this. Also, to avoid any confusion with other data-related functions and objects I named the dataset fred instead of dat.

Building The Skeleton

That’s all we need for the most minimalist package. By executing the code I presented here your global environment should only contain two things: the function trajectories and the dataframe fred. Let’s check:

ls()

"fred"    "trajectories"

If there’s more in your environment, get rid of it with:

rm(list=ls()[!ls()%in%c('fred','trajectories')])

One more thing: make sure that you have set the working directory to where you want to create your package. Now, let’s build the package. The first function you can use to create the structure you need for an R-Package is package.skeleton. This function will usually require you to provide one – at most two – arguments. The first (name) is the name of the package. I will name this one “dsp” – as in DataSciencePlus. With the second argument you can provide a list of objects in your global environment that you want in the package. Because we just ensured that there is nothing but the necessary components in the global environment we can neglect this argument. Doing so will pour your entire global environment into the package.

package.skeleton('dsp')

Creating directories ...
Creating DESCRIPTION ...
Creating NAMESPACE ...
Creating Read-and-delete-me ...
Saving functions and data ...
Making help files ...
Done.
Further steps are described in './dsp/Read-and-delete-me'.

This should created a directory called “dsp” in my working directory. To investigate what this new directory contains, we can use (or your basic file browser, whichever you prefer)

dir('./dsp')

"data"   "DESCRIPTION"    "man"   "NAMESPACE"         
"R"      "Read-and-delete-me"

The three elements data, man, and R are directories; the other three are text files.

dir('./dsp',recursive=TRUE)

"data/fred.rda"       "DESCRIPTION"         "man/dsp-package.Rd"  "man/fred.Rd"        
"man/trajectories.Rd" "NAMESPACE"           "R/dsp-internal.R"    "Read-and-delete-me" 
"R/trajectories.R"

As you can see the data directory contains the dataset we created, called fred.rda. In the future, all example datasets we want to add to the package will also be stored in this directory. The man directory contains three .Rd files. The ending stands for R documentation, making it quite probable that man stands for manual. We will take a closer look at these files in an upcoming post, but rest assured that these are what is used to generate the HTML you see when you call the help-function on any R-function. The final directory – R – contains the .R-files. One is internal – something we will also look at at a later date – and one is named after and contains the function we created.

The DESCRIPTION

Before looking at the DESCRIPTION file, we should take a look at the aptly named Read-and-delete-me. This file is automatically generated and should contain the following text:

* Edit the help file skeletons in 'man', possibly combining help files for
  multiple functions.
* Edit the exports in 'NAMESPACE', and add necessary imports.
* Put any C/C++/Fortran code in 'src'.
* If you have compiled code, add a useDynLib() directive to 'NAMESPACE'.
* Run R CMD build to build the package tarball.
* Run R CMD check to check the package tarball.

Read "Writing R Extensions" for more information.

All very good advice – some of which we will even follow. On to the DESCRIPTION-file. This should look something like this:

Package: dsp
Type: Package
Title: What the package does (short line)
Version: 1.0
Date: 2015-09-14
Author: Who wrote it
Maintainer: Who to complain to 
Description: More about what it does (maybe more than one line)
License: What license is it under?

The first two lines are acceptable as is, but we should change the title of the package to something a bit more descriptive.

Title: DataSciencePlus Example

Next up is the version. Version numbering can be an immensely tricky subject and there are quite a few guidelines out there. There is just one rule that is given by R: all values must be numeric. The rest is up to you. The one guideline I (try to) follow is called Semantic Versioning and is very close to the one proposed by Hadley Wickham in his book on writing R-Packages. The idea is that the version consists of three components: major.minor.patch. The idea is that major versions may break compatibility with previous versions, while minor versions do not. Patches are bug fixes which introduce no changes besides repairing what was not working correctly in the release. Your version number should start out at 0.1.0 and the minor version should increase until the initial public release. At that point it should change to 1.0.0 and increase in the major.minor.patch format.

Version: 0.1.0

The next three lines can be changed at your leisure – you should always provide a functioning Email-Address to take responsibility for your mistakes. The package description should clarify what the package can be used for or what is. In this case let’s go with:

Description: An example for creating packages from DataSciencePlus

The final line – the license – can also be topic of some debate. You should concern yourself with it before making the package accessible to the public, but it is not something we will concern ourselves with right now.

What we have now is the bare-bone structure of the package. Next time we will take a look at writing the documentation for a package and some other fun stuff.

Package Building