Integrating R with Apache Hadoop

Integrating R to work on Hadoop is to address the requirement to scale R program to work with petabyte scale data. The primary goal of this post is to elaborate different techniques for integrating R with Hadoop.

Approach 1: Using R and Streaming APIs in Hadoop

In order to integrate an R function with Hadoop and see it running in a MapReduce mode, Hadoop supports Streaming APIs for R. These Streaming APIs primary help running any script that can access and operate with standard I/O in a map- reduce mode. So, in case of R, there wouldn’t be any explicit client side integration done with R. Following is an example for R and streaming:

$ ${HADOOP_HOME}/bin/Hadoop jar
${HADOOP_HOME}/contrib/streaming/*.jar \
-inputformat
org.apache.hadoop.mapred.TextInputFormat \
-input input_data.txt \
-output output \
-mapper /home/tst/src/map.R \
-reducer /home/tst/src/reduce.R \
-file /home/tst/src/map.R \
-file /home/tst/src/reduce.R

Approach 2: Using Rhipe package of R

There is a package in R called “Rhipe” that allows running a MapReduce job within R. To use this way of implementing R on Hadoop there are some pre-requisites. R needs to be installed on each Data node in the Hadoop Cluster, Protocol Buffers will be installed and available on each Data node (for more on Protocol Buffer) and Rhipe should be available on each data node.

Following is a sample format for using Rhipe library in R to implement MapReduce:

library(Rhipe)
rhinit(TRUE, TRUE);
map<-expression ( {lapply (map.values, function(mapper)…)})
reduce<-expression(
pre = {…},
reduce = {…},
post = {…},
)
x <- rhmr(map=map, reduce=reduce,
 ifolder=inputPath,
 ofolder=outputPath,
 inout=c('text', 'text'),
 jobname='test name'))
rhex(x)

Approach 3: Using RHADOOP

RHadoop, very similar to RHipe, facilitates running R functions in a MapReduce mode. It is an open source library built by Revolution Analytics. Following are some packages the are a part of the RHadoop library. plyrmr apackage that provides functions for common data manipulation requirements for large datasets running on Hadoop. rmr a package that has collection of functions that integrate R and Hadoop. rdfs a package with functions that help interface R and HDFS. rhbase a package with functions that help interface R and HBase

Following is an example that uses rmr package and demonstrates the steps to integrate R and Hadoop using the functions from that package.

library(rmr)
maplogic<-function(k,v) { …}
reducelogic<-function(k,vv) { …}
mapreduce( input ="data.txt",
output="output",
textinputformat =rawtextinputformat,
map = maplogic,
reduce=reducelogic
)

Summary of R / Hadoop integration approaches

In summary, all the above three approaches yield results and facilitate integrating R and Hadoop and help scaling R to operate on large scale data that will work on HDFS and each of these approaches has pros and cons.

Key Conclusions:

Hadoop Streaming API is the simplest of all the approaches as it is easy to install / set-up. Both Rhipe and RHadoop require some effort to set up R and related packages on the Hadoop cluster. In terms of implementation approach Streaming API is more of a command line map and reduce function while both Rhipe and RHadoop allow developers to define and call custom MapReduce function within R functions. Also, for Hadoop Streaming API, there is no client side integration required while both Rhipe and RHadoop require client side integration.

The other alternatives to scaling machine learning are Apache Mahout, Apache Hive and some commercial versions of R from Revolution Analytics, Segue framework among others.

More detail on this topic is covered in my latest publication on Machine learning.

Feel free to comment below if you have any question.

HadoopTips & Tricks