An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.

Regression Models

In the first part, we introduced the weather dataset and outlined its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:

- tomorrow’s weather forecast at 9am of the current day
- tomorrow’s weather forecast at 3pm of the current day
- tomorrow’s weather forecast at late evening time of the current day

For each of above tasks, a specific subset of variables shall be available, and precisely:

- 9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
- 3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
- evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed

We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.

suppressPackageStartupMessages(library(caret)) set.seed(1023) weather_data5 <- read.csv("weather_data5.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE) colnames(weather_data5)[1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" [6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [16] "Temp3pm" "RainTomorrow"

nrow(weather_data5)[1] 353

sum(weather_data5["RainTomorrow"] == "Yes")[1] 64

sum(weather_data5["RainTomorrow"] == "No")[1] 289

We are going to split the original dataset in a training dataset (70% of original data) and a testing dataset (30% remaining).

train_rec <- createDataPartition(weather_data5$RainTomorrow, p = 0.7, list = FALSE) training <- weather_data5[train_rec,] testing <- weather_data5[-train_rec,]

We check the balance of RainTomorrow Yes/No fractions in the training and testing datasets.

sum(training["RainTomorrow"] == "Yes")/sum(training["RainTomorrow"] == "No")[1] 0.2216749

sum(testing["RainTomorrow"] == "Yes")/sum(testing["RainTomorrow"] == "No")[1] 0.2209302

The dataset is slightly unbalanced with respect the label to be predicted. We will check later if some remedy is needed or achieved accuracy is satisfactory as well.

For all models, we are going to take advantage of a train control directive made available by the caret package which prescribes repeated k-fold cross-validation. The k-fold cross validation method involves splitting the training dataset into k-subsets. For each subset, one is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the training dataset, and an overall accuracy estimate is provided. The process of splitting the data into k-folds can be repeated a number of times and this is called repeated k-fold cross validation. The final model accuracy is taken as the mean from the number of repeats.

trControl <- trainControl(method = "repeatedcv", repeats = 5, number = 10, verboseIter = FALSE)

The trainControl is passed as a parameter to the train() caret function. As a further input to the train() function, the tuneLength parameter indicates the number of different values to try for each algorithm parameter. However in case of the “glm” method, it does not drive any specific behavior and hence it will be omitted.

By taking into account the results from exploratory analysis, we herein compare two models for 9AM prediction. The first one is so determined and evaluated.

predictors_9am_c1 <- c("Cloud9am", "Humidity9am", "Pressure9am", "Temp9am") formula_9am_c1 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c6, collapse="+"), sep="~")) mod9am_c1_fit <- train(formula_9am_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c1_fit$results$Accuracy[1] 0.8383538

The resulting accuracy shown above is the one determined by the repeated k-fold cross validation as above explained. The obtained value is not that bad considering the use of just 9AM available data.

(summary_rep <- summary(mod9am_c1_fit$finalModel))Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.7667 -0.5437 -0.3384 -0.1874 2.7742 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 96.01611 33.57702 2.860 0.004242 ** Cloud9am 0.17180 0.07763 2.213 0.026893 * Humidity9am 0.06769 0.02043 3.313 0.000922 *** Pressure9am -0.10286 0.03283 -3.133 0.001729 ** Temp9am 0.10595 0.04545 2.331 0.019744 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 177.34 on 243 degrees of freedom AIC: 187.34 Number of Fisher Scoring iterations: 5

From above summary, all predictors result as significative for the logistic regression model. We can test the null hypothesis that the logistic model represents an adequate fit for the data by computing the following p-value.

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)[1] 0.9994587

We further investigate if any predictor can be dropped by taking advantage of the drop1() function.

drop1(mod9am_c1_fit$finalModel, test="Chisq")Single term deletions Model: .outcome ~ Cloud9am + Humidity9am + Pressure9am + Temp9am Df Deviance AIC LRT Pr(>Chi) 177.34 187.34 Cloud9am 1 182.48 190.48 5.1414 0.0233623 * Humidity9am 1 190.19 198.19 12.8502 0.0003374 *** Pressure9am 1 187.60 195.60 10.2668 0.0013544 ** Temp9am 1 183.13 191.13 5.7914 0.0161050 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can evaluate a second model where the MinTemp is replaced by the Temp9am. We do not keep both as they are correlated (see part #1 exploratory analysis).

predictors_9am_c2 <- c("Cloud9am", "Humidity9am", "Pressure9am", "MinTemp") formula_9am_c2 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c2, collapse="+"), sep="~")) mod9am_c2_fit <- train(formula_9am_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c2_fit$results$Accuracy[1] 0.8366462

(summary_rep <- summary(mod9am_c2_fit$finalModel))Deviance Residuals: Min 1Q Median 3Q Max -1.7233 -0.5446 -0.3510 -0.1994 2.7237 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 103.00407 33.57102 3.068 0.00215 ** Cloud9am 0.16379 0.08014 2.044 0.04096 * Humidity9am 0.05462 0.01855 2.945 0.00323 ** Pressure9am -0.10792 0.03298 -3.272 0.00107 ** MinTemp 0.06750 0.04091 1.650 0.09896 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.9 on 247 degrees of freedom Residual deviance: 180.3 on 243 degrees of freedom AIC: 190.3 Number of Fisher Scoring iterations: 5

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)[1] 0.9990409

This second model shows similar accuracy values, however MinTemp p-value shows that such predictor is not significative. Further, the explained variance is slightly less than the first model one. As a conclusion, we select the first model for 9AM predictions purpose and we go on by calculating accuracy with the test set.

mod9am_pred <- predict(mod9am_c1_fit, testing) confusionMatrix(mod9am_pred, testing[,"RainTomorrow"])Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No

The confusion matrix shows encouraging results, not a bad test accuracy for a 9AM weather forecast. We then go on building the 3PM prediction model.

We are going to use what we expect to be reliable predictors at 3PM. We have already seen that they are not strongly correlated each other and they are not linearly dependent.

predictors_3pm_c1 <- c("Cloud3pm", "Humidity3pm", "Pressure3pm", "Temp3pm") formula_3pm_c1 <- as.formula(paste("RainTomorrow", paste(predictors_3pm_c1, collapse="+"), sep="~")) mod3pm_c1_fit <- train(formula_3pm_c1, data = training, method = "glm", family = "binomial", trControl = trControl, metric = 'Accuracy') mod3pm_c1$_fitresults$Accuracy[1] 0.8582333

This model shows an acceptable accuracy as measured on the training set.

(summary_rep <- summary(mod3pm_c1_fit$finalModel))Deviance Residuals: Min 1Q Median 3Q Max -2.3065 -0.5114 -0.2792 -0.1340 2.6641 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 122.56229 35.20777 3.481 0.000499 *** Cloud3pm 0.27754 0.09866 2.813 0.004906 ** Humidity3pm 0.06745 0.01817 3.711 0.000206 *** Pressure3pm -0.12885 0.03443 -3.743 0.000182 *** Temp3pm 0.10877 0.04560 2.385 0.017071 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 159.64 on 243 degrees of freedom AIC: 169.64 Number of Fisher Scoring iterations: 6

All predictors are reported as significative for the model. Let us further verify if any predictor can be dropped.

drop1(mod3pm_c1_fit$finalModel, test="Chisq")Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm Df Deviance AIC LRT Pr(>Chi) 159.64 169.64 Cloud3pm 1 168.23 176.23 8.5915 0.003377 ** Humidity3pm 1 175.91 183.91 16.2675 5.500e-05 *** Pressure3pm 1 175.60 183.60 15.9551 6.486e-05 *** Temp3pm 1 165.77 173.77 6.1329 0.013269 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)[1] 0.9999913

We go on with the computation of the test set accuracy.

mod3pm_pred <- predict(mod3pm_c1_fit, testing) confusionMatrix(mod3pm_pred, testing[,"RainTomorrow"])Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

The test set prediction accuracy is quite satisfactory.

As first evening forecast model, we introduce the Sunshine variable and at the same time we take Cloud3pm and Humidity3pm out as strongly correlated to Sunshine.

predictors_evening_c1 <- c("Pressure3pm", "Temp3pm", "Sunshine") formula_evening_c1 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c1, collapse="+"), sep="~")) mod_ev_c1_fit <- train(formula_evening_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c1_fit$results$Accuracy[1] 0.8466974

The training set based accuracy is acceptable.

(summary_rep <- summary(mod_ev_c1_fit$finalModel))Deviance Residuals: Min 1Q Median 3Q Max -1.9404 -0.4532 -0.2876 -0.1350 2.4664 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 174.43458 37.12201 4.699 2.61e-06 *** Pressure3pm -0.17175 0.03635 -4.726 2.29e-06 *** Temp3pm 0.06506 0.03763 1.729 0.0838 . Sunshine -0.41519 0.07036 -5.901 3.61e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 156.93 on 244 degrees of freedom AIC: 164.93 Number of Fisher Scoring iterations: 6

The Temp3pm predictor is reported as not significative and can be dropped as confirmed below.

drop1(mod_ev_c1_fit$finalModel, test="Chisq")Single term deletions Model: .outcome ~ Pressure3pm + Temp3pm + Sunshine Df Deviance AIC LRT Pr(>Chi) 156.93 164.93 Pressure3pm 1 185.64 191.64 28.712 8.400e-08 *** Temp3pm 1 160.03 166.03 3.105 0.07805 . Sunshine 1 203.03 209.03 46.098 1.125e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)[1] 0.9999968

As a second tentative model, we take advantage of the 3PM model predictors together with WindGustDir and WindGustSpeed.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir", "WindGustSpeed") formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~")) mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c2_fit$results$Accuracy[1] 0.8681179

It results with an improved training set accuracy.

(summary_rep <- summary(mod_ev_c2_fit$finalModel))Deviance Residuals: Min 1Q Median 3Q Max -2.44962 -0.42289 -0.20929 -0.04328 2.76204 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 9.928e+01 4.863e+01 2.042 0.041193 * Cloud3pm 2.569e-01 1.081e-01 2.376 0.017481 * Humidity3pm 8.441e-02 2.208e-02 3.822 0.000132 *** Pressure3pm -1.088e-01 4.695e-02 -2.318 0.020462 * Temp3pm 1.382e-01 5.540e-02 2.495 0.012596 * WindGustDirENE -1.064e+00 1.418e+00 -0.750 0.453254 WindGustDirESE 1.998e+00 1.088e+00 1.837 0.066235 . WindGustDirN -2.731e-03 1.759e+00 -0.002 0.998761 WindGustDirNE 1.773e+00 1.260e+00 1.407 0.159364 WindGustDirNNE -1.619e+01 2.884e+03 -0.006 0.995520 WindGustDirNNW 1.859e+00 1.021e+00 1.821 0.068672 . WindGustDirNW 1.011e+00 1.009e+00 1.002 0.316338 WindGustDirS 1.752e+00 1.248e+00 1.404 0.160394 WindGustDirSE -1.549e+01 2.075e+03 -0.007 0.994043 WindGustDirSSE -1.051e-01 1.636e+00 -0.064 0.948777 WindGustDirSSW -1.451e+01 6.523e+03 -0.002 0.998225 WindGustDirSW 2.002e+01 6.523e+03 0.003 0.997551 WindGustDirW 1.108e+00 1.183e+00 0.936 0.349051 WindGustDirWNW 1.325e-01 1.269e+00 0.104 0.916848 WindGustDirWSW -1.574e+01 4.367e+03 -0.004 0.997124 WindGustSpeed 1.674e-02 2.065e-02 0.811 0.417420 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 135.61 on 227 degrees of freedom AIC: 177.61 Number of Fisher Scoring iterations: 17

The WindGustDir and WindGustSpeed predictors are reported as not significative. Also the AIC value is sensibly increased.

drop1(mod_ev_c2_fit$finalModel, test="Chisq")Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW + WindGustSpeed Df Deviance AIC LRT Pr(>Chi) 135.61 177.61 Cloud3pm 1 141.64 181.64 6.0340 0.01403 * Humidity3pm 1 154.49 194.49 18.8817 1.391e-05 *** Pressure3pm 1 141.38 181.38 5.7692 0.01631 * Temp3pm 1 142.21 182.21 6.5992 0.01020 * WindGustDirENE 1 136.20 176.20 0.5921 0.44159 WindGustDirESE 1 139.40 179.40 3.7883 0.05161 . WindGustDirN 1 135.61 175.61 0.0000 0.99876 WindGustDirNE 1 137.55 177.55 1.9412 0.16354 WindGustDirNNE 1 136.40 176.40 0.7859 0.37535 WindGustDirNNW 1 139.43 179.43 3.8160 0.05077 . WindGustDirNW 1 136.69 176.69 1.0855 0.29747 WindGustDirS 1 137.64 177.64 2.0293 0.15430 WindGustDirSE 1 136.36 176.36 0.7458 0.38782 WindGustDirSSE 1 135.61 175.61 0.0041 0.94866 WindGustDirSSW 1 135.64 175.64 0.0339 0.85384 WindGustDirSW 1 138.38 178.38 2.7693 0.09609 . WindGustDirW 1 136.52 176.52 0.9106 0.33996 WindGustDirWNW 1 135.62 175.62 0.0109 0.91689 WindGustDirWSW 1 135.85 175.85 0.2414 0.62318 WindGustSpeed 1 136.27 176.27 0.6633 0.41541 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDir has some borderline p-value for some specific directions. WindGustSpeed is not significative and we should drop it from the model. Hence, we redefine such model after having taken WindGustSpeed off while keeping WindGustDir.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir") formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~")) mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c2_fit$results$Accuracy[1] 0.8574256

(summary_rep <- summary(mod_ev_c2_fit$finalModel))Deviance Residuals: Min 1Q Median 3Q Max -2.45082 -0.40624 -0.21271 -0.04667 2.71193 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 118.76549 42.77047 2.777 0.005490 ** Cloud3pm 0.25566 0.10794 2.369 0.017859 * Humidity3pm 0.08052 0.02126 3.787 0.000152 *** Pressure3pm -0.12701 0.04172 -3.044 0.002333 ** Temp3pm 0.13035 0.05441 2.396 0.016592 * WindGustDirENE -1.13954 1.43581 -0.794 0.427398 WindGustDirESE 2.03313 1.10027 1.848 0.064624 . WindGustDirN -0.00212 1.68548 -0.001 0.998996 WindGustDirNE 1.59121 1.25530 1.268 0.204943 WindGustDirNNE -16.31932 2891.85796 -0.006 0.995497 WindGustDirNNW 1.87617 1.03532 1.812 0.069962 . WindGustDirNW 1.09105 1.01928 1.070 0.284433 WindGustDirS 1.93346 1.24036 1.559 0.119046 WindGustDirSE -15.50996 2073.57766 -0.007 0.994032 WindGustDirSSE -0.09029 1.63401 -0.055 0.955934 WindGustDirSSW -14.34617 6522.63868 -0.002 0.998245 WindGustDirSW 19.99670 6522.63868 0.003 0.997554 WindGustDirW 1.16834 1.18736 0.984 0.325127 WindGustDirWNW 0.16961 1.28341 0.132 0.894858 WindGustDirWSW -15.71816 4349.40572 -0.004 0.997117 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 136.27 on 228 degrees of freedom AIC: 176.27 Number of Fisher Scoring iterations: 17

drop1(mod_ev_c2_fit$finalModel, test="Chisq")Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW Df Deviance AIC LRT Pr(>Chi) 136.27 176.27 Cloud3pm 1 142.24 180.24 5.9653 0.014590 * Humidity3pm 1 154.50 192.50 18.2255 1.962e-05 *** Pressure3pm 1 146.76 184.76 10.4852 0.001203 ** Temp3pm 1 142.33 180.33 6.0611 0.013819 * WindGustDirENE 1 136.93 174.93 0.6612 0.416126 WindGustDirESE 1 140.13 178.13 3.8568 0.049545 * WindGustDirN 1 136.27 174.27 0.0000 0.998996 WindGustDirNE 1 137.87 175.87 1.5970 0.206332 WindGustDirNNE 1 137.14 175.14 0.8698 0.351012 WindGustDirNNW 1 140.09 178.09 3.8159 0.050769 . WindGustDirNW 1 137.52 175.52 1.2477 0.263994 WindGustDirS 1 138.78 176.78 2.5032 0.113617 WindGustDirSE 1 137.03 175.03 0.7541 0.385179 WindGustDirSSE 1 136.28 174.28 0.0031 0.955862 WindGustDirSSW 1 136.30 174.30 0.0290 0.864850 WindGustDirSW 1 139.00 177.00 2.7314 0.098393 . WindGustDirW 1 137.28 175.28 1.0100 0.314905 WindGustDirWNW 1 136.29 174.29 0.0174 0.894921 WindGustDirWSW 1 136.51 174.51 0.2373 0.626164 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDirESE is reported as significant. Hence to include WindGustDir is right and then we accept that model as a candidate one. The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)[1] 0.9999998

To investigate a final third choice, we gather a small set of predictors, Pressure3pm and Sunshine.

predictors_evening_c3 <- c("Pressure3pm", "Sunshine") formula_evening_c3 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c3, collapse="+"), sep="~")) mod_ev_c3_fit <- train(formula_evening_c3, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c3_fit$results$Accuracy[1] 0.8484513

The training set based accuracy has an acceptable value.

(summary_rep <- summary(mod_ev_c3_fit$finalModel))Deviance Residuals: Min 1Q Median 3Q Max -2.1123 -0.4602 -0.2961 -0.1562 2.3997 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 192.97146 36.09640 5.346 8.99e-08 *** Pressure3pm -0.18913 0.03545 -5.336 9.52e-08 *** Sunshine -0.35973 0.06025 -5.971 2.36e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 160.03 on 245 degrees of freedom AIC: 166.03 Number of Fisher Scoring iterations: 6

All predictors are reported as significative.

drop1(mod_ev_c3_fit$finalModel, test="Chisq")Single term deletions Model: .outcome ~ Pressure3pm + Sunshine Df Deviance AIC LRT Pr(>Chi) 160.03 166.03 Pressure3pm 1 199.36 203.36 39.324 3.591e-10 *** Sunshine 1 206.09 210.09 46.060 1.147e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)[1] 0.9999938

We compare the last two models by running an ANOVA analysis on those to check if the lower residual deviance of the first model is significative or not.

anova(mod_ev_c2_fit$finalModel, mod_ev_c3_fit$finalModel, test="Chisq")Analysis of Deviance Table Model 1: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW Model 2: .outcome ~ Pressure3pm + Sunshine Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 228 136.27 2 245 160.03 -17 -23.76 0.1261

Based on p-value, there is no significative difference between them. We then choose both models. The first model because it provides with a better accuracy than the second. The second model for its simplicity. Let us evaluate the test accuracy for both of them.

modevening_pred <- predict(mod_ev_c2_fit, testing) confusionMatrix(modevening_pred, testing[,"RainTomorrow"])Confusion Matrix and Statistics Reference Prediction No Yes No 83 9 Yes 3 10 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.5604 Mcnemar's Test P-Value : 0.14891 Sensitivity : 0.9651 Specificity : 0.5263 Pos Pred Value : 0.9022 Neg Pred Value : 0.7692 Prevalence : 0.8190 Detection Rate : 0.7905 Detection Prevalence : 0.8762 Balanced Accuracy : 0.7457 'Positive' Class : No

Good test accuracy with a high sensitivity and positive predicted values. We then test the second evening forecast candidate model.

modevening_pred <- predict(mod_ev_c3_fit, testing) confusionMatrix(modevening_pred, testing[,"RainTomorrow"])Confusion Matrix and Statistics Reference Prediction No Yes No 81 7 Yes 5 12 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.598 Mcnemar's Test P-Value : 0.77283 Sensitivity : 0.9419 Specificity : 0.6316 Pos Pred Value : 0.9205 Neg Pred Value : 0.7059 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8381 Balanced Accuracy : 0.7867 'Positive' Class : No

Slightly higher accuracy and lower sensitivity than the previous model. Positive predicitive value is improved with respect the same.

Resuming up our final choices:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

We save the training record indexes, the training and testing datasets together with all the chosen models for further analysis.

saveRDS(list(weather_data5, train_rec, training, testing, mod9am_c1_fit, mod9am_c2_fit, mod3pm_c1_fit, mod_ev_c2_fit, mod_ev_c3_fit), file="wf_log_reg_part2.rds")

We build models for weather forecasts purposes at different hours of the day. We explored predictors significancy and training accuracy to determine which are the best models. We computed test accuracy and collect results. Prediction accuracy was shown to be quite satisfactory for 3PM and evening models. In next part of this tutorial, we will tune all those models to improve their accuracy if possible.

If you have any questions, please feel free to comment below.