After working with R for some time you realize that can be difficult to deal with moderate big and big data sets. The issue is related to the fact that everything in R is an object and these objects are loaded in RAM. So if you don’t have enough RAM you need some other way to deal with these data sets.

Today, in the BIG DATA era, it’s not difficult to find situations like that. I had to deal with this problem myself when working with public data sets in Brazil as RAIS, Brazilian Census and etc. In this tutorial, to solve this problem I will present you an integration of R and MonetDB. The advantages of MonetDB over other known solutions will be clear in the end.

Possible solutions

Basically, in situations where the data set is “manageable” within an ordinary PC, but is larger than RAM, the ideal is to use a package like ff or bigmemory or even use a DBMS. I would say that when working with data sets that far exceed the size of RAM, for me, it is always best to work with database management systems such as PostgreSQL, MySQL or MonetDB. This is especially true with public microdata, which in general you only need to read once and from that on just query. In addition to allowing consistent data storage, you can also use the dplyr package with the database backend, so you can use the database tables using pretty much the same syntax that you would use with a data.frame. It is a solution that is fast, does not have the limitations that R has and that you can use for a long time.

Which one to use?

I always recommend PostgreSQL and MySQL. They are open source projects, widely used and documented and have several features that are expected in a good DBMS. BUT THERE IS A PROBLEM: you need to install and configure them. It may not seem like a huge problem but the fact is that this is a serious barrier for many R users. Also, this can be an overkill solution to many users who do not need all the features that systems like that can offer as it demands a lot of extra work to solve a simple problem as manipulate files larger than RAM. But there are alternatives! Of the available DBMSs, two of them are easy to use with R and need no install:

The first can be easily used with R through the RSQlite package and the second through the MonetDBLite package. My favorite, and what I’m going to use here as an example, is MonetDB. I will use it because it is the one that does the storage by columns. So, write operation is more expensive but any query operation is very much cheaper than in a regular DBMS. If you work with data that does not need change often columnar database management systems are far superior. Specifically, in the case of microdata that you need to load once and only read from there, I believe MonetDB is the best option available today.

Application

To reproduce my solution you will need MonetDBLite installed on your system. So:

install.packages('MonetDBLite', dependencies = TRUE)

From there, the next step will be to load the data into the database. Fortunately MonetDBLite is a package that allows you to do this automatically using the monetdb.read.csv () function. Assuming you have the ENEM dataset, in my case microdados_enem2014.csv, in the same working directory where you are going to run the script, run the following commands:

## Loading packages
library(MonetDBLite)
library(DBI)

## Set directory to the database
dbdir <- 'database/'

## Creating a conection to the database
con <- dbConnect( MonetDBLite::MonetDBLite() , dbdir )

## Dumping the csv file to the database
monetdb.read.csv(conn = con, files = 'microdados_enem2014.csv', tablename = 'enem2014', header = TRUE, na.strings = '', delim = ',')

## Listing tables
dbListTables(con)

## Couting rows in the table
dbGetQuery(con, 'SELECT count(*) FROM enem2014')

## Quering the firts 100 rows
teste <- dbGetQuery(con, "SELECT * FROM enem2014 LIMIT 100")

The upload did not take more than a minute on my machine, a core i7 and 16GB of RAM. But I believe that in a weaker machine should take a little longer. Another fundamental point is that the file is in UTF-8 encoding for loading, and ENEM csvs are in ISO-8859-1. You can easily change encoding through the iconv command on the Linux terminal:

iconv -f ISO-8859-1 -t UTF-8 MICRODADOS_ENEM_2014.csv > microdados_enem2014.csv

but in Windows you must follow the installation procedure of the iconv shown here. In fact there are several ways to change file encoding in Windows and this is only a suggestion.

Another point to note is that since the table and the database have already been created, you can make queries directly in SQL, as if you were in the bank terminal. If you know SQL your problem is solved and just generate the required data from the query and then process them as data.frame in R.

Using dplyr

Of course, the best option to use databases with the R is to use dplyr directly, which allows the R user to work with the database without writing any SQL. The coolest of this strategy is that dplyr converts the commands in R to queries in SQL and even the intermediate results of queries are inside the database, such that there are no performance problems related to the limitations of R with RAM. As an example, we will refer to the student’s average grade by state and administrative dependency, using dplyr.

## Loading packages
library(dplyr)

## Connection to the database
my_db <- MonetDBLite::src_monetdb(embedded=dbdir)
my_tbl <- tbl(my_db, "enem2014")

## Getting the average over province and administrative dependency
consulta % group_by(COD_UF_ESC, ID_DEPENDENCIA_ADM_ESC) %>% summarise(mean(NOTA_MT))

## Collecting the data
consulta <- collect(consulta)

Conclusion

This is an excellent solution to work with a large data set, filtering out smaller chunks and then analyze them within the R. It is also an easier way to keep large datasets stored for future queries on various projects. HOWEVER is not a solution when you need to fit models directly in the DBMS. If this is the case there are other better solutions.