R Programming – Pitfalls to avoid (Part 1)

As we continue to program in R, all of us would have inevitably encountered multiple errors or bugs in our code. Not all programming errors are created equal – Many of the errors we encounter are pretty straight-forward to deal with, with clear, unambiguous error messages that a little googling (or reading the help documentation) can help us resolve.

On the other hand, some of the errors or bugs that we encounter can really test our patience & resolve. What makes an error daunting to deal with, is probably one or more of these factors (1) Error is of a conceptual nature rather than superficial causes such as missing or wrongly spelled arguments (2) Code usually works fine but only fails under specific circumstances (3) Rather than immediately failing, the codes generates an unexpected outcome which results in an error, probably, much later in your program. Tackling such errors can be a very frustrating experience.

The post, targeted at beginner to intermediate level R users, would highlight a few such potential pitfalls that you would most likely encounter (if you haven’t already) as you continue to program in R. Being aware of these traps can help us be prepared for it, potentially saving us countless hours that we may otherwise spend trying to resolve them.

Factors in R

At first sight, factor variables in R seem harmless enough – As you may know, factor variables are essentially categorical variables which take on a limited number of unique values (defined as factor levels).
What makes factors potentially dangerous, is in the way factor variables are coded – Factor levels are stored as a vector of integer values and this can result in some fairly bewildering and unintended results, if we are not careful while dealing with them. In one of the books I read on R programming (R Inferno by Patrick Burns) factors were quite aptly termed as “tricky little devils”!

Without further ado, let’s look at a simple example, which demonstrates this tricky nature.
(All examples in this tutorial would use the in-built iris dataset, which I assume you are all familiar with. If not, you can read the help documentation on this dataset help(iris)

Let's assume that we intend to change the name of one of the species in the dataset – for e.g. we would like to shorten "versicolor" to "versi".
The problem seems pretty straightforward and we decide to use the ifelse function to implement this logic using the following code:

iris$Species <- ifelse(iris$Species == "versicolor","versi",iris$Species)

The code runs without any error, as we would have expected – However, if you were to view the dataset, you would quickly realize that the results are not what you wanted (While versicolor changes to versi, you would also observe numeric values in the column)

And this is what makes it so dangerous. The fact that you may not even be aware that something unexpected as happened until, probably, much later when a code fails in an unexpected manner. And then to trace that error back to its origin can at times be really challenging.

How to deal with it

There are multiple ways in which this can be dealt with.

We could simply coerce the factor to a character using as.character. So, the previous example would be written as:

iris$Species <- ifelse(iris$Species == "versicolor","versi",as.character(iris$Species))

You may also want to consider, if you would like to set the stringAsFactors argument to False while reading a dataset. This ensures that character values are not automatically converted to factor variables, rather they are read as character vectors. You can then manipulate them as character vectors and eventually transform them back to factors, if required.

Simplifying vs Preserving Data Types

I recollect first reading about this concept in Hadley’s Advanced R programming. At that time, I quickly skimmed through it naively assuming it to be more of theoretical interest than practical. However, over time, as I encountered multiple errors which stemmed from an inadequate & incomplete understanding of this concept, I became increasingly sensitive of this aspect of R programming.

To illustrate this concept, let’s devise a simple example.

Assume, that we would like to define a simple function. The function takes 2 arguments – A dataframe and a numeric vector. The function then returns the mean of the columns in the dataframe with the numeric vector used as column indices.

We write the function as follows:

mean_func <- function(df,var_list) {
  # Extract out the selected columns as per the supplied vector of indices
  df_selected <- df[,var_list]
 #Loop thru each column in the dataset 
 sapply(df_selected,mean)
}

Let’s try out this function with the iris dataset
To compute the mean of the first 2 columns of iris, we invoke the function as follows

mean_func(iris,c(1:2))

The results would be as expected.

However, you would be in for a surprise, if you would like to compute the mean for just one column – let’s assume the 1st column (Sepal Length)

mean_func(iris,1)
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1
 [25] 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6
 [49] 5.3 5.0 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7
 [97] 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0
[121] 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

Oops! What just happened!

It turns out, that when we refer to a dataframe using the matrix notation, the output is simplified to the lowest possible dimensionality. Try these 2 codes for instance

str(iris[,1:2]) 
str(iris[,1])

The first object is a dataframe but the second object, given that it has just one column, gets simplified to a vector.

How to deal with it

1. Specify the drop argument
You can set the drop argument to False to ensure that the output is not simplified

df_selected <- df[,var_list, drop=F]

2. Use list notation
As you may know, a dataframe is also a special type of list, with each column representing the list element. These 2 codes are equivalent:

iris[,c(1:2)]
iris[c(1:2)]

Both returns a dataframe with 2 columns. However, the list notation does not simplify the output, so iris[1] returns a dataframe with 1 column rather than simplifying the output to a vector.

Data types getting simplified or preserved plays out in various other similar contexts. So it pays to be especially aware of this fundamental concept while programming in R.

OOP

While R is an Object Oriented Program (OOP), you can do a fair bit of work in R without knowing a whole lot about OOP principles. However, at least a rudimentary understanding of some of the key OOP concepts and the way it’s implemented in R, can help avoid a lot of confusion and frustration.

Two concepts in particular stand out in my mind:

1. Polymorphism
As an illustration of the confusion that this can cause, consider trying out different model fitting functions in R. For instance, you have built a logistic regression model and to predict the outcome on a new dataset, you try out the following code:

predict(glm_model, newdata = test_df, type="response")

This works as expected and returns the predicted probability for each observation. But if you try the same code on a different model object (say, a decision tree model using the rpart package), this code will fail. You would have to specify type="prob" for generating the predicted probabilities. Even the return type vary.

Underlying this apparent confusion, is the concept of polymorphism – Function behavior is tied to the class of the object supplied as argument. If we read the help documentation of rpart package (using help(package="rpart") ) for instance, you can find a predict.rpart listed in the documentation. And this is what we should refer to, if we would like to understand how predict is defined for a rpart object.

2. S4 classes
It turns out, that there are multiple Object Oriented systems in R with the default (used in base & stat packages) being, what is called as the S3 class system. S3 is also a fairly informal class system and that’s why most of us can get by without really worrying about OOP paradigms.

R, however, also has a more formally defined class system – S4 classes. Most of the packages use the S3 class system, and hence when you first encounter a package that implements S4, it can be a bewildering experience. (If I recollect correctly, my first exposure to S4, was while using the ROCR package – A packages which implements various performance measures for classification models).

Fortunately, there isn’t a whole lot that you need to know about S4 objects for most of your tasks. Be aware of these minimal facts (1) Similar to the names function that you use to investigate the components of a S3 object (if they are named), in S4 you will use slotNames (2) To extract out specific components from an object rather than using the $ operator like we do for S3, you would need to use the @ operator.

Hope you find this useful. Please feel free to comment below, in case of any queries or questions.

Thanks!

Data ManipulationTips & Tricks