This blog post series is on machine learning with R. We will use the Caret package in R. In this part, we will first perform exploratory Data Analysis (EDA) on a real-world dataset, and then apply non-regularized linear regression to solve a supervised regression problem on the dataset. We will predict power output given a set of environmental readings from various sensors in a natural gas-fired power generation plant.

In the second part of the post, we will work with regularized linear regression models (ridge, lasso and elasticnet). Next, we will see the other non-linear regression models. The real-world data we are using in this post consists of 9,568 data points, each with 4 environmental attributes collected from a Combined Cycle Power Plant over 6 years (2006-2011), and is provided by the University of California, Irvine at UCI Machine Learning Repository Combined Cycle Power Plant Data Set. You can find more details about the dataset on the UCI page. The task is a regression problem since the label (or target) we are trying to predict is numeric.

Libraries

The caret package (short for _C_lassification _A_nd _RE_gression _T_raining) is a set of functions that attempt to streamline the process for creating predictive models.

library(readxl)
library(ggplot2)
library(corrplot)
library(tidyverse)

Load Data

power_plant = read_excel("Folds5x2_pp.xlsx")

Exploratory Data Analysis (EDA)

This is a step that we should always perform before trying to fit a model to the data, as this step will often lead to important insights about our data.

class(power_plant)
'tbl_df' 'tbl' 'data.frame'
# Caret faces problems working with tbl, 
# so let's change the data to simple data frame
power_plant = data.frame(power_plant)
message("The class is now ", class(power_plant))
# See first few rows
head(power_plant)
AT	V	AP	RH	PE
14.96	41.76	1024.07	73.17	463.26
25.18	62.96	1020.04	59.08	444.37
5.11	39.40	1012.16	92.14	488.56
20.86	57.32	1010.24	76.64	446.48
10.82	37.50	1009.23	96.62	473.90
26.27	59.44	1012.23	58.77	443.67

The columns in the DataFrame are:

Power Output is the value we are trying to predict given the measurements above.

# Size of  DataFrame
dim(power_plant)     
9568 5
map(power_plant, class) 
# all columns are numeric
$AT
'numeric'
$V
'numeric'
$AP
'numeric'
$RH
'numeric'
$PE
'numeric'

Are there any missing values in any of the columns?

map(power_plant, ~sum(is.na(.)))  # using purrr package
# There is no missing data in all of the columns
$AT
0
$V
0
$AP
0
$RH
0
$PE
0

Visualize relationship between variables

Before we perform any modeling, it is a good idea to explore correlations between the predictors and the predictand. This step can be important as it helps us to select appropriate models. If our features and the outcome are linearly related, we may start with linear regression models. However, if the relationships between the label and the features are non-linear, non-linear ensemble models such as random forest can be better.

Correlation between atmospheric temperature and power output

power_plant %>% ggplot(aes(AT, PE)) +
    geom_point(color= "blue", alpha = 0.3) +
    ggtitle("Temperature vs Power Output") +
    xlab("Atmospheric Temperature") +
    ylab("Power Output") +
    theme(plot.title = element_text(color="darkred",
                                    size=18,hjust = 0.5),
                     axis.text.y = element_text(size=12),
                      axis.text.x = element_text(size=12,hjust=.5),
                      axis.title.x = element_text(size=14),
                      axis.title.y = element_text(size=14))

Gives this plot:

As shown in the above figure, there is strong linear correlation between Atmospheric Temperature and Power Output.

Correlation between Exhaust Vacuum Speed and power output

power_plant %>% ggplot(aes(V, PE)) +
    geom_point(color= "darkgreen", alpha = 0.3) +
    ggtitle("Exhaust Vacuum Speed vs Power Output") +
    xlab("Exhaust Vacuum Speed") +
    ylab("Power Output") +
    theme(plot.title = element_text(color="darkred",size=18,hjust = 0.5),
                     axis.text.y = element_text(size=12),
                      axis.text.x = element_text(size=12,hjust=.5),
                      axis.title.x = element_text(size=14),
                      axis.title.y = element_text(size=14))

Gives this plot:

Correlation between atmospheric pressure and power output

power_plant %>% ggplot(aes(AP, PE)) +
    geom_point(color= "red", alpha = 0.3) +
    ggtitle("Atmospheric Pressure vs Power Output") +
    xlab("Atmospheric Pressure") +
    ylab("Power Output") +
    theme(plot.title = element_text(color="darkred",size=18,hjust = 0.5),
                     axis.text.y = element_text(size=12),
                      axis.text.x = element_text(size=12,hjust=.5),
                      axis.title.x = element_text(size=14),
                      axis.title.y = element_text(size=14))

Correlation between relative humidity and power output

power_plant %>% ggplot(aes(RH, PE)) +
    geom_point(color= "magenta", alpha = 0.3) +
    ggtitle("Relative Humidity vs Power Output") +
    xlab("Relative Humidity") +
    ylab("Power Output") +
    theme(plot.title = element_text(color="darkred",size=18,hjust = 0.5),
                     axis.text.y = element_text(size=12),
                      axis.text.x = element_text(size=12,hjust=.5),
                      axis.title.x = element_text(size=14),
                      axis.title.y = element_text(size=14))

Gives this plot:

Correlation heatmap

correlations = cor(power_plant)
corrplot(correlations, method="color")

Gives this plot:

As shown in the correlation heatmap above, the target is correlated with the features. However, we also observe correlation among the features, hence we have multi-collinearity problem. He will use regularization to check if the collinearity we observe has a significant impact on the performance of linear regression model.

Data Modeling

All the columns are numeric and there are no missing values, which makes our modeling task straightforward. Now, let’s model our data to predict what the power output will be given a set of sensor readings. Our first model will be based on simple linear regression since we saw some linear patterns in our data based on the scatter plots and correlation heatmap during the exploration stage. We need a way of evaluating how well our linear regression model predicts power output as a function of input parameters. We can do this by splitting up our initial data set into a Training Set, used to train our model and a Test Set, used to evaluate the model’s performance in giving predictions.

library(caret)
# Caret streamlines the process for creating predictive models.
# The package contains tools for:
#  data splitting
#  pre-processing
#  feature selection
#  model tuning using resampling
#  variable importance estimation
#  as well as other functionality.

The Caret createDataPartition function creates indices for training and test portion of the data for us.

# create training set indices with 80% of data
set.seed(100)  # For reproducibility
# Create index for testing and training data
inTrain <- createDataPartition(y = power_plant$PE, 
                               p = 0.8, list = FALSE)
# subset power_plant data to training
training <- power_plant[inTrain,]
# subset the rest to test
 testing <- power_plant[-inTrain,]

# Size ratio of training and test dataset
message("As shown below, the training set is about 80%  and the test set is about 20% of the original data")

rbind("Training set" = nrow(training)/nrow(power_plant),
      "Testing set" = nrow(testing)/nrow(power_plant)) %>% 
       round(2)*100
Training set	80
Testing set	20

As shown above, the training set is about 80% and the test set is about 20% of the original data

Linear Regression

# Fit linear regression model
# put the predictors on the same scale: mean of zero and unit variance
my_lm = train(training[,1:4], training[,5],
               method = "lm",
               preProc = c("center", "scale")
              )
message("Linear Regression: Model performance on \n the training set")
my_lm$results[c("RMSE","Rsquared")] %>%
        round(2)
summary(my_lm)
Linear Regression: Model performance on 
 the training set
RMSE	Rsquared
4.61	0.93
Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-43.281  -3.164  -0.117   3.239  17.704 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 454.39620    0.05233 8683.725  < 2e-16 ***
AT          -14.62996    0.12728 -114.943  < 2e-16 ***
V            -3.03266    0.10347  -29.309  < 2e-16 ***
AP            0.39110    0.06320    6.189 6.38e-10 ***
RH           -2.23520    0.06804  -32.849  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.579 on 7651 degrees of freedom
Multiple R-squared:  0.9281,	Adjusted R-squared:  0.9281 
F-statistic: 2.469e+04 on 4 and 7651 DF,  p-value: < 2.2e-16

Predict using the test data

pred = predict(my_lm, testing[, 1:4])
SSE = sum((testing[,5] -pred)^2)    # sum of squared errors
SST = sum((testing[,5] - mean(training[,5]))^2) # total sum of squares, remember to use training data here
R_square = 1 - SSE/SST
message('R_squared on the test data:')
round(R_square, 2)
SSE = sum((testing[,5] - pred)^2)
RMSE = sqrt(SSE/length(pred))
message("Root mean square error on the test data: ")
round(RMSE, 2)
R_squared on the test data:
0.93
Root mean square error on the test data: 
4.48

Preparing data for ploting

my_data = as.data.frame(cbind(predicted = pred,
                            observed = testing$PE))
# Plot predictions vs test data
ggplot(my_data,aes(predicted, observed)) +
      geom_point(color = "darkred", alpha = 0.5) + 
      geom_smooth(method=lm)+ ggtitle('Linear Regression ') +
      ggtitle("Linear Regression: Prediction vs Test Data") +
      xlab("Predecited Power Output ") +
      ylab("Observed Power Output") +
      theme(plot.title = element_text(color="darkgreen",size=18,hjust = 0.5),
                     axis.text.y = element_text(size=12),
          
            axis.text.x = element_text(size=12,hjust=.5),
                      axis.title.x = element_text(size=14),
                      axis.title.y = element_text(size=14))

Gives this plot:

This blog post series is on machine learning with R. In this first post, we saw non-regularized multivariate linear regression. In the second part of the post, we will work with regularized linear regression models. Next, we will see the other non-linear regression models. See you in the next post.