Extreme Gradient Boosting is amongst the excited R and Python libraries in machine learning these times. Previously, I have written a tutorial on how to use Extreme Gradient Boosting with R. In this post, I will elaborate on how to conduct an analysis in Python. Extreme Gradient Boosting supports various objective functions, including regression, classification, and ranking. It has gained much popularity and attention recently as it was the algorithm of choice for many winning teams of many machine learning competitions.

This post is a continuation of my previous Machine learning with R blog post series. The first one is available here.

Import Python libraries

import xgboost as xgb
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

Read the data into a Pandas dataframe

power_plant = pd.read_excel("Folds5x2_pp.xlsx")
Create training and test datasets
X = power_plant.drop("PE", axis = 1)
y = power_plant['PE'].values
y = y.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2, random_state=42)

Convert the training and testing sets into DMatrixes

DMatrix is the recommended class in xgboost.

DM_train = xgb.DMatrix(data = X_train, 
                       label = y_train)  
DM_test =  xgb.DMatrix(data = X_test,
                       label = y_test)

There are different hyperparameters that we can tune and the parametres are different from baselearner to baselearner. In tree based learners, which are the most common ones in xgboost applications, the following are the most commonly tuned hyperparameters:
learning rate:

In both R and Python, the default base learners are trees (gbtree) but we can also specify gblinear for linear models and dart for both classification and regression problems.
In this post, I will optimize only three of the parameters shown above and you can try optimizing the other parameters. You can see the list of parameters and their details from the website.
Parameters for grid search

gbm_param_grid = {
     'colsample_bytree': np.linspace(0.5, 0.9, 5),
     'n_estimators':[100, 200],
     'max_depth': [10, 15, 20, 25]
}

Instantiate the regressor

gbm = xgb.XGBRegressor()

Perform grid search

Let’s perform 5 fold cross-validation using mean square error as a scoring method.

grid_mse = GridSearchCV(estimator = gbm, param_grid = gbm_param_grid, scoring = 'neg_mean_squared_error', cv = 5, verbose = 1)

Fit grid_mse to the data, get best parameters and best score (lowest RMSE)

grid_mse.fit(X_train, y_train)
print("Best parameters found: ",grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))
Fitting 5 folds for each of 40 candidates, totalling 200 fits
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed: 11.0min finished
Best parameters found:  {'colsample_bytree': 0.80000000000000004, 'max_depth': 15, 'n_estimators': 200}
Lowest RMSE found:  3.03977094354

Predict using the test data

pred = grid_mse.predict(X_test)
print("Root mean square error for test dataset: {}".format(np.round(np.sqrt(mean_squared_error(y_test, pred)), 2)))
Root mean square error for test dataset: 2.76
test = pd.DataFrame({"prediction": pred, "observed": y_test.flatten()})
lowess = sm.nonparametric.lowess
z = lowess(pred.flatten(), y_test.flatten())
test.plot(figsize = [14,8],
          x ="prediction", y = "observed", kind = "scatter", color = 'darkred')
plt.title("Extreme Gradient Boosting: Prediction Vs Test Data", fontsize = 18, color = "darkgreen")
plt.xlabel("Predicted Power Output", fontsize = 18) 
plt.ylabel("Observed Power Output", fontsize = 18)
plt.plot(z[:,0], z[:,1], color = "blue", lw= 3)
plt.show()

The plot:

Summary

In this post, I used python to run Extreme Gradient Boosting to predict power output. We see that it has better performance than linear model we tried in the first part of the blog post series.