An online community for showcasing R & Python tutorials. It operates as a networking platform for data scientists to promote their talent and get hired. Our mission is to empower data scientists by bridging the gap between talent and opportunity.
Regression Models

Weather forecast with regression models – part 2

In the first part, we introduced the weather dataset and outlined its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:

• tomorrow’s weather forecast at 9am of the current day
• tomorrow’s weather forecast at 3pm of the current day
• tomorrow’s weather forecast at late evening time of the current day

For each of above tasks, a specific subset of variables shall be available, and precisely:

• 9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
• 3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
• evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed

We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.

suppressPackageStartupMessages(library(caret))
set.seed(1023)
weather_data5 <- read.csv("weather_data5.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(weather_data5)
[1] "MinTemp"       "MaxTemp"       "Evaporation"   "Sunshine"      "WindGustDir"
[6] "WindGustSpeed" "WindSpeed9am"  "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"
[11] "Pressure9am"   "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"
[16] "Temp3pm"       "RainTomorrow"

nrow(weather_data5)
[1] 353

sum(weather_data5["RainTomorrow"] == "Yes")
[1] 64
sum(weather_data5["RainTomorrow"] == "No")
[1] 289


We are going to split the original dataset in a training dataset (70% of original data) and a testing dataset (30% remaining).

train_rec <- createDataPartition(weather_data5$RainTomorrow, p = 0.7, list = FALSE) training <- weather_data5[train_rec,] testing <- weather_data5[-train_rec,]  We check the balance of RainTomorrow Yes/No fractions in the training and testing datasets. sum(training["RainTomorrow"] == "Yes")/sum(training["RainTomorrow"] == "No") [1] 0.2216749  sum(testing["RainTomorrow"] == "Yes")/sum(testing["RainTomorrow"] == "No") [1] 0.2209302  The dataset is slightly unbalanced with respect the label to be predicted. We will check later if some remedy is needed or achieved accuracy is satisfactory as well. 9AM Forecast Model For all models, we are going to take advantage of a train control directive made available by the caret package which prescribes repeated k-fold cross-validation. The k-fold cross validation method involves splitting the training dataset into k-subsets. For each subset, one is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the training dataset, and an overall accuracy estimate is provided. The process of splitting the data into k-folds can be repeated a number of times and this is called repeated k-fold cross validation. The final model accuracy is taken as the mean from the number of repeats. trControl <- trainControl(method = "repeatedcv", repeats = 5, number = 10, verboseIter = FALSE)  The trainControl is passed as a parameter to the train() caret function. As a further input to the train() function, the tuneLength parameter indicates the number of different values to try for each algorithm parameter. However in case of the “glm” method, it does not drive any specific behavior and hence it will be omitted. By taking into account the results from exploratory analysis, we herein compare two models for 9AM prediction. The first one is so determined and evaluated. predictors_9am_c1 <- c("Cloud9am", "Humidity9am", "Pressure9am", "Temp9am") formula_9am_c1 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c6, collapse="+"), sep="~")) mod9am_c1_fit <- train(formula_9am_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c1_fit$results$Accuracy [1] 0.8383538  The resulting accuracy shown above is the one determined by the repeated k-fold cross validation as above explained. The obtained value is not that bad considering the use of just 9AM available data. (summary_rep <- summary(mod9am_c1_fit$finalModel))
Call:
NULL

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.7667  -0.5437  -0.3384  -0.1874   2.7742

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 96.01611   33.57702   2.860 0.004242 **
Cloud9am     0.17180    0.07763   2.213 0.026893 *
Humidity9am  0.06769    0.02043   3.313 0.000922 ***
Pressure9am -0.10286    0.03283  -3.133 0.001729 **
Temp9am      0.10595    0.04545   2.331 0.019744 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 177.34  on 243  degrees of freedom
AIC: 187.34

Number of Fisher Scoring iterations: 5



From above summary, all predictors result as significative for the logistic regression model. We can test the null hypothesis that the logistic model represents an adequate fit for the data by computing the following p-value.

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9994587



We further investigate if any predictor can be dropped by taking advantage of the drop1() function.

drop1(mod9am_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud9am + Humidity9am + Pressure9am + Temp9am Df Deviance AIC LRT Pr(>Chi) 177.34 187.34 Cloud9am 1 182.48 190.48 5.1414 0.0233623 * Humidity9am 1 190.19 198.19 12.8502 0.0003374 *** Pressure9am 1 187.60 195.60 10.2668 0.0013544 ** Temp9am 1 183.13 191.13 5.7914 0.0161050 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  We can evaluate a second model where the MinTemp is replaced by the Temp9am. We do not keep both as they are correlated (see part #1 exploratory analysis). predictors_9am_c2 <- c("Cloud9am", "Humidity9am", "Pressure9am", "MinTemp") formula_9am_c2 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c2, collapse="+"), sep="~")) mod9am_c2_fit <- train(formula_9am_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c2_fit$results$Accuracy [1] 0.8366462  (summary_rep <- summary(mod9am_c2_fit$finalModel))
Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.7233  -0.5446  -0.3510  -0.1994   2.7237

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 103.00407   33.57102   3.068  0.00215 **
Cloud9am      0.16379    0.08014   2.044  0.04096 *
Humidity9am   0.05462    0.01855   2.945  0.00323 **
Pressure9am  -0.10792    0.03298  -3.272  0.00107 **
MinTemp       0.06750    0.04091   1.650  0.09896 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.9  on 247  degrees of freedom
Residual deviance: 180.3  on 243  degrees of freedom
AIC: 190.3

Number of Fisher Scoring iterations: 5



The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9990409



This second model shows similar accuracy values, however MinTemp p-value shows that such predictor is not significative. Further, the explained variance is slightly less than the first model one. As a conclusion, we select the first model for 9AM predictions purpose and we go on by calculating accuracy with the test set.

mod9am_pred <- predict(mod9am_c1_fit, testing)
confusionMatrix(mod9am_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics

Reference
Prediction No Yes
No  80  12
Yes  6   7

Accuracy : 0.8286
95% CI : (0.7427, 0.8951)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.4602

Kappa : 0.3405
Mcnemar's Test P-Value : 0.2386

Sensitivity : 0.9302
Specificity : 0.3684
Pos Pred Value : 0.8696
Neg Pred Value : 0.5385
Prevalence : 0.8190
Detection Rate : 0.7619
Detection Prevalence : 0.8762
Balanced Accuracy : 0.6493

'Positive' Class : No



The confusion matrix shows encouraging results, not a bad test accuracy for a 9AM weather forecast. We then go on building the 3PM prediction model.

3PM Forecast Model

We are going to use what we expect to be reliable predictors at 3PM. We have already seen that they are not strongly correlated each other and they are not linearly dependent.

predictors_3pm_c1 <- c("Cloud3pm", "Humidity3pm", "Pressure3pm", "Temp3pm")
formula_3pm_c1 <- as.formula(paste("RainTomorrow", paste(predictors_3pm_c1, collapse="+"), sep="~"))
mod3pm_c1_fit <- train(formula_3pm_c1,  data = training, method = "glm", family = "binomial",
trControl = trControl, metric = 'Accuracy')
mod3pm_c1$_fitresults$Accuracy
[1] 0.8582333



This model shows an acceptable accuracy as measured on the training set.

(summary_rep <- summary(mod3pm_c1_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.3065 -0.5114 -0.2792 -0.1340 2.6641 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 122.56229 35.20777 3.481 0.000499 *** Cloud3pm 0.27754 0.09866 2.813 0.004906 ** Humidity3pm 0.06745 0.01817 3.711 0.000206 *** Pressure3pm -0.12885 0.03443 -3.743 0.000182 *** Temp3pm 0.10877 0.04560 2.385 0.017071 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 159.64 on 243 degrees of freedom AIC: 169.64 Number of Fisher Scoring iterations: 6  All predictors are reported as significative for the model. Let us further verify if any predictor can be dropped. drop1(mod3pm_c1_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm
Df Deviance    AIC     LRT  Pr(>Chi)
159.64 169.64
Cloud3pm     1   168.23 176.23  8.5915  0.003377 **
Humidity3pm  1   175.91 183.91 16.2675 5.500e-05 ***
Pressure3pm  1   175.60 183.60 15.9551 6.486e-05 ***
Temp3pm      1   165.77 173.77  6.1329  0.013269 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999913



We go on with the computation of the test set accuracy.

mod3pm_pred <- predict(mod3pm_c1_fit, testing)
confusionMatrix(mod3pm_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics

Reference
Prediction No Yes
No  82   8
Yes  4  11

Accuracy : 0.8857
95% CI : (0.8089, 0.9395)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.04408

Kappa : 0.58
Mcnemar's Test P-Value : 0.38648

Sensitivity : 0.9535
Specificity : 0.5789
Pos Pred Value : 0.9111
Neg Pred Value : 0.7333
Prevalence : 0.8190
Detection Rate : 0.7810
Detection Prevalence : 0.8571
Balanced Accuracy : 0.7662

'Positive' Class : No



The test set prediction accuracy is quite satisfactory.

Evening Forecast Model

As first evening forecast model, we introduce the Sunshine variable and at the same time we take Cloud3pm and Humidity3pm out as strongly correlated to Sunshine.

predictors_evening_c1 <- c("Pressure3pm", "Temp3pm", "Sunshine")
formula_evening_c1 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c1, collapse="+"), sep="~"))
mod_ev_c1_fit <- train(formula_evening_c1,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c1_fit$results$Accuracy
[1] 0.8466974



The training set based accuracy is acceptable.

(summary_rep <- summary(mod_ev_c1_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -1.9404 -0.4532 -0.2876 -0.1350 2.4664 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 174.43458 37.12201 4.699 2.61e-06 *** Pressure3pm -0.17175 0.03635 -4.726 2.29e-06 *** Temp3pm 0.06506 0.03763 1.729 0.0838 . Sunshine -0.41519 0.07036 -5.901 3.61e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 156.93 on 244 degrees of freedom AIC: 164.93 Number of Fisher Scoring iterations: 6  The Temp3pm predictor is reported as not significative and can be dropped as confirmed below. drop1(mod_ev_c1_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Pressure3pm + Temp3pm + Sunshine
Df Deviance    AIC    LRT  Pr(>Chi)
156.93 164.93
Pressure3pm  1   185.64 191.64 28.712 8.400e-08 ***
Temp3pm      1   160.03 166.03  3.105   0.07805 .
Sunshine     1   203.03 209.03 46.098 1.125e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999968



As a second tentative model, we take advantage of the 3PM model predictors together with WindGustDir and WindGustSpeed.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir", "WindGustSpeed")
formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~"))
mod_ev_c2_fit <- train(formula_evening_c2,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c2_fit$results$Accuracy
[1] 0.8681179



It results with an improved training set accuracy.

(summary_rep <- summary(mod_ev_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.44962 -0.42289 -0.20929 -0.04328 2.76204 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 9.928e+01 4.863e+01 2.042 0.041193 * Cloud3pm 2.569e-01 1.081e-01 2.376 0.017481 * Humidity3pm 8.441e-02 2.208e-02 3.822 0.000132 *** Pressure3pm -1.088e-01 4.695e-02 -2.318 0.020462 * Temp3pm 1.382e-01 5.540e-02 2.495 0.012596 * WindGustDirENE -1.064e+00 1.418e+00 -0.750 0.453254 WindGustDirESE 1.998e+00 1.088e+00 1.837 0.066235 . WindGustDirN -2.731e-03 1.759e+00 -0.002 0.998761 WindGustDirNE 1.773e+00 1.260e+00 1.407 0.159364 WindGustDirNNE -1.619e+01 2.884e+03 -0.006 0.995520 WindGustDirNNW 1.859e+00 1.021e+00 1.821 0.068672 . WindGustDirNW 1.011e+00 1.009e+00 1.002 0.316338 WindGustDirS 1.752e+00 1.248e+00 1.404 0.160394 WindGustDirSE -1.549e+01 2.075e+03 -0.007 0.994043 WindGustDirSSE -1.051e-01 1.636e+00 -0.064 0.948777 WindGustDirSSW -1.451e+01 6.523e+03 -0.002 0.998225 WindGustDirSW 2.002e+01 6.523e+03 0.003 0.997551 WindGustDirW 1.108e+00 1.183e+00 0.936 0.349051 WindGustDirWNW 1.325e-01 1.269e+00 0.104 0.916848 WindGustDirWSW -1.574e+01 4.367e+03 -0.004 0.997124 WindGustSpeed 1.674e-02 2.065e-02 0.811 0.417420 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 135.61 on 227 degrees of freedom AIC: 177.61 Number of Fisher Scoring iterations: 17  The WindGustDir and WindGustSpeed predictors are reported as not significative. Also the AIC value is sensibly increased. drop1(mod_ev_c2_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE +
WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE +
WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE +
WindGustDirWNW + WindGustDirWSW + WindGustSpeed
Df Deviance    AIC     LRT  Pr(>Chi)
135.61 177.61
Cloud3pm        1   141.64 181.64  6.0340   0.01403 *
Humidity3pm     1   154.49 194.49 18.8817 1.391e-05 ***
Pressure3pm     1   141.38 181.38  5.7692   0.01631 *
Temp3pm         1   142.21 182.21  6.5992   0.01020 *
WindGustDirENE  1   136.20 176.20  0.5921   0.44159
WindGustDirESE  1   139.40 179.40  3.7883   0.05161 .
WindGustDirN    1   135.61 175.61  0.0000   0.99876
WindGustDirNE   1   137.55 177.55  1.9412   0.16354
WindGustDirNNE  1   136.40 176.40  0.7859   0.37535
WindGustDirNNW  1   139.43 179.43  3.8160   0.05077 .
WindGustDirNW   1   136.69 176.69  1.0855   0.29747
WindGustDirS    1   137.64 177.64  2.0293   0.15430
WindGustDirSE   1   136.36 176.36  0.7458   0.38782
WindGustDirSSE  1   135.61 175.61  0.0041   0.94866
WindGustDirSSW  1   135.64 175.64  0.0339   0.85384
WindGustDirSW   1   138.38 178.38  2.7693   0.09609 .
WindGustDirW    1   136.52 176.52  0.9106   0.33996
WindGustDirWNW  1   135.62 175.62  0.0109   0.91689
WindGustDirWSW  1   135.85 175.85  0.2414   0.62318
WindGustSpeed   1   136.27 176.27  0.6633   0.41541
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



WindGustDir has some borderline p-value for some specific directions. WindGustSpeed is not significative and we should drop it from the model. Hence, we redefine such model after having taken WindGustSpeed off while keeping WindGustDir.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir")
formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~"))
mod_ev_c2_fit <- train(formula_evening_c2,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c2_fit$results$Accuracy
[1] 0.8574256


(summary_rep <- summary(mod_ev_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.45082 -0.40624 -0.21271 -0.04667 2.71193 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 118.76549 42.77047 2.777 0.005490 ** Cloud3pm 0.25566 0.10794 2.369 0.017859 * Humidity3pm 0.08052 0.02126 3.787 0.000152 *** Pressure3pm -0.12701 0.04172 -3.044 0.002333 ** Temp3pm 0.13035 0.05441 2.396 0.016592 * WindGustDirENE -1.13954 1.43581 -0.794 0.427398 WindGustDirESE 2.03313 1.10027 1.848 0.064624 . WindGustDirN -0.00212 1.68548 -0.001 0.998996 WindGustDirNE 1.59121 1.25530 1.268 0.204943 WindGustDirNNE -16.31932 2891.85796 -0.006 0.995497 WindGustDirNNW 1.87617 1.03532 1.812 0.069962 . WindGustDirNW 1.09105 1.01928 1.070 0.284433 WindGustDirS 1.93346 1.24036 1.559 0.119046 WindGustDirSE -15.50996 2073.57766 -0.007 0.994032 WindGustDirSSE -0.09029 1.63401 -0.055 0.955934 WindGustDirSSW -14.34617 6522.63868 -0.002 0.998245 WindGustDirSW 19.99670 6522.63868 0.003 0.997554 WindGustDirW 1.16834 1.18736 0.984 0.325127 WindGustDirWNW 0.16961 1.28341 0.132 0.894858 WindGustDirWSW -15.71816 4349.40572 -0.004 0.997117 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 136.27 on 228 degrees of freedom AIC: 176.27 Number of Fisher Scoring iterations: 17  drop1(mod_ev_c2_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE +
WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE +
WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE +
WindGustDirWNW + WindGustDirWSW
Df Deviance    AIC     LRT  Pr(>Chi)
136.27 176.27
Cloud3pm        1   142.24 180.24  5.9653  0.014590 *
Humidity3pm     1   154.50 192.50 18.2255 1.962e-05 ***
Pressure3pm     1   146.76 184.76 10.4852  0.001203 **
Temp3pm         1   142.33 180.33  6.0611  0.013819 *
WindGustDirENE  1   136.93 174.93  0.6612  0.416126
WindGustDirESE  1   140.13 178.13  3.8568  0.049545 *
WindGustDirN    1   136.27 174.27  0.0000  0.998996
WindGustDirNE   1   137.87 175.87  1.5970  0.206332
WindGustDirNNE  1   137.14 175.14  0.8698  0.351012
WindGustDirNNW  1   140.09 178.09  3.8159  0.050769 .
WindGustDirNW   1   137.52 175.52  1.2477  0.263994
WindGustDirS    1   138.78 176.78  2.5032  0.113617
WindGustDirSE   1   137.03 175.03  0.7541  0.385179
WindGustDirSSE  1   136.28 174.28  0.0031  0.955862
WindGustDirSSW  1   136.30 174.30  0.0290  0.864850
WindGustDirSW   1   139.00 177.00  2.7314  0.098393 .
WindGustDirW    1   137.28 175.28  1.0100  0.314905
WindGustDirWNW  1   136.29 174.29  0.0174  0.894921
WindGustDirWSW  1   136.51 174.51  0.2373  0.626164
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



WindGustDirESE is reported as significant. Hence to include WindGustDir is right and then we accept that model as a candidate one. The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999998



To investigate a final third choice, we gather a small set of predictors, Pressure3pm and Sunshine.

predictors_evening_c3 <- c("Pressure3pm", "Sunshine")
formula_evening_c3 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c3, collapse="+"), sep="~"))
mod_ev_c3_fit <- train(formula_evening_c3,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c3_fit$results$Accuracy

[1] 0.8484513



The training set based accuracy has an acceptable value.

(summary_rep <- summary(mod_ev_c3_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.1123 -0.4602 -0.2961 -0.1562 2.3997 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 192.97146 36.09640 5.346 8.99e-08 *** Pressure3pm -0.18913 0.03545 -5.336 9.52e-08 *** Sunshine -0.35973 0.06025 -5.971 2.36e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 160.03 on 245 degrees of freedom AIC: 166.03 Number of Fisher Scoring iterations: 6  All predictors are reported as significative. drop1(mod_ev_c3_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Pressure3pm + Sunshine
Df Deviance    AIC    LRT  Pr(>Chi)
160.03 166.03
Pressure3pm  1   199.36 203.36 39.324 3.591e-10 ***
Sunshine     1   206.09 210.09 46.060 1.147e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999938



We compare the last two models by running an ANOVA analysis on those to check if the lower residual deviance of the first model is significative or not.

anova(mod_ev_c2_fit$finalModel, mod_ev_c3_fit$finalModel, test="Chisq")
Analysis of Deviance Table

Model 1: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE +
WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE +
WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE +
WindGustDirWNW + WindGustDirWSW
Model 2: .outcome ~ Pressure3pm + Sunshine
Resid. Df Resid. Dev  Df Deviance Pr(>Chi)
1       228     136.27
2       245     160.03 -17   -23.76   0.1261



Based on p-value, there is no significative difference between them. We then choose both models. The first model because it provides with a better accuracy than the second. The second model for its simplicity. Let us evaluate the test accuracy for both of them.

modevening_pred <- predict(mod_ev_c2_fit, testing)
confusionMatrix(modevening_pred, testing[,"RainTomorrow"])

Confusion Matrix and Statistics

Reference
Prediction No Yes
No  83   9
Yes  3  10

Accuracy : 0.8857
95% CI : (0.8089, 0.9395)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.04408

Kappa : 0.5604
Mcnemar's Test P-Value : 0.14891

Sensitivity : 0.9651
Specificity : 0.5263
Pos Pred Value : 0.9022
Neg Pred Value : 0.7692
Prevalence : 0.8190
Detection Rate : 0.7905
Detection Prevalence : 0.8762
Balanced Accuracy : 0.7457

'Positive' Class : No



Good test accuracy with a high sensitivity and positive predicted values. We then test the second evening forecast candidate model.

modevening_pred <- predict(mod_ev_c3_fit, testing)
confusionMatrix(modevening_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics

Reference
Prediction No Yes
No  81   7
Yes  5  12

Accuracy : 0.8857
95% CI : (0.8089, 0.9395)
No Information Rate : 0.819
P-Value [Acc > NIR] : 0.04408

Kappa : 0.598
Mcnemar's Test P-Value : 0.77283

Sensitivity : 0.9419
Specificity : 0.6316
Pos Pred Value : 0.9205
Neg Pred Value : 0.7059
Prevalence : 0.8190
Detection Rate : 0.7714
Detection Prevalence : 0.8381
Balanced Accuracy : 0.7867

'Positive' Class : No



Slightly higher accuracy and lower sensitivity than the previous model. Positive predicitive value is improved with respect the same.

Resuming up our final choices:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am
mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm
mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir
mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine



We save the training record indexes, the training and testing datasets together with all the chosen models for further analysis.

saveRDS(list(weather_data5, train_rec, training, testing, mod9am_c1_fit, mod9am_c2_fit, mod3pm_c1_fit, mod_ev_c2_fit, mod_ev_c3_fit), file="wf_log_reg_part2.rds")


Conclusions

We build models for weather forecasts purposes at different hours of the day. We explored predictors significancy and training accuracy to determine which are the best models. We computed test accuracy and collect results. Prediction accuracy was shown to be quite satisfactory for 3PM and evening models. In next part of this tutorial, we will tune all those models to improve their accuracy if possible.

If you have any questions, please feel free to comment below.