Weather forecast with regression models – part 2

In the first part, we introduced the weather dataset and outlined its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:

tomorrow’s weather forecast at 9am of the current day
tomorrow’s weather forecast at 3pm of the current day
tomorrow’s weather forecast at late evening time of the current day

For each of above tasks, a specific subset of variables shall be available, and precisely:

9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed

We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.

suppressPackageStartupMessages(library(caret))
set.seed(1023)
weather_data5 <- read.csv("weather_data5.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(weather_data5)
[1] "MinTemp"       "MaxTemp"       "Evaporation"   "Sunshine"      "WindGustDir"  
 [6] "WindGustSpeed" "WindSpeed9am"  "WindSpeed3pm"  "Humidity9am"   "Humidity3pm"  
[11] "Pressure9am"   "Pressure3pm"   "Cloud9am"      "Cloud3pm"      "Temp9am"      
[16] "Temp3pm"       "RainTomorrow"

nrow(weather_data5)
[1] 353

sum(weather_data5["RainTomorrow"] == "Yes")
[1] 64

sum(weather_data5["RainTomorrow"] == "No")
[1] 289

We are going to split the original dataset in a training dataset (70% of original data) and a testing dataset (30% remaining).

train_rec <- createDataPartition(weather_data5$RainTomorrow, p = 0.7, list = FALSE)
training <- weather_data5[train_rec,]
testing <- weather_data5[-train_rec,]

We check the balance of RainTomorrow Yes/No fractions in the training and testing datasets.

sum(training["RainTomorrow"] == "Yes")/sum(training["RainTomorrow"] == "No")
[1] 0.2216749

sum(testing["RainTomorrow"] == "Yes")/sum(testing["RainTomorrow"] == "No")
[1] 0.2209302

The dataset is slightly unbalanced with respect the label to be predicted. We will check later if some remedy is needed or achieved accuracy is satisfactory as well.

9AM Forecast Model

For all models, we are going to take advantage of a train control directive made available by the caret package which prescribes repeated k-fold cross-validation. The k-fold cross validation method involves splitting the training dataset into k-subsets. For each subset, one is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the training dataset, and an overall accuracy estimate is provided. The process of splitting the data into k-folds can be repeated a number of times and this is called repeated k-fold cross validation. The final model accuracy is taken as the mean from the number of repeats.

trControl <- trainControl(method = "repeatedcv",  repeats = 5, number = 10, verboseIter = FALSE)

The trainControl is passed as a parameter to the train() caret function. As a further input to the train() function, the tuneLength parameter indicates the number of different values to try for each algorithm parameter. However in case of the “glm” method, it does not drive any specific behavior and hence it will be omitted.

By taking into account the results from exploratory analysis, we herein compare two models for 9AM prediction. The first one is so determined and evaluated.

predictors_9am_c1 <- c("Cloud9am",  "Humidity9am", "Pressure9am", "Temp9am")
formula_9am_c1 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c1, collapse="+"), sep="~"))
mod9am_c1_fit <- train(formula_9am_c1,  data=training, method="glm", 
                       family="binomial", trControl = trControl, metric = 'Accuracy')
mod9am_c1_fit$results$Accuracy
[1] 0.8383538

The resulting accuracy shown above is the one determined by the repeated k-fold cross validation as above explained. The obtained value is not that bad considering the use of just 9AM available data.

(summary_rep <- summary(mod9am_c1_fit$finalModel))
Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7667  -0.5437  -0.3384  -0.1874   2.7742  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 96.01611   33.57702   2.860 0.004242 ** 
Cloud9am     0.17180    0.07763   2.213 0.026893 *  
Humidity9am  0.06769    0.02043   3.313 0.000922 ***
Pressure9am -0.10286    0.03283  -3.133 0.001729 ** 
Temp9am      0.10595    0.04545   2.331 0.019744 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 177.34  on 243  degrees of freedom
AIC: 187.34

Number of Fisher Scoring iterations: 5

From above summary, all predictors result as significative for the logistic regression model. We can test the null hypothesis that the logistic model represents an adequate fit for the data by computing the following p-value.

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9994587

We further investigate if any predictor can be dropped by taking advantage of the drop1() function.

drop1(mod9am_c1_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud9am + Humidity9am + Pressure9am + Temp9am
            Df Deviance    AIC     LRT  Pr(>Chi)    
           177.34 187.34                      
Cloud9am     1   182.48 190.48  5.1414 0.0233623 *  
Humidity9am  1   190.19 198.19 12.8502 0.0003374 ***
Pressure9am  1   187.60 195.60 10.2668 0.0013544 ** 
Temp9am      1   183.13 191.13  5.7914 0.0161050 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can evaluate a second model where the MinTemp is replaced by the Temp9am. We do not keep both as they are correlated (see part #1 exploratory analysis).

predictors_9am_c2 <- c("Cloud9am",  "Humidity9am", "Pressure9am", "MinTemp")
formula_9am_c2 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c2, collapse="+"), sep="~"))
mod9am_c2_fit <- train(formula_9am_c2,  data=training, method="glm", 
                       family="binomial", trControl = trControl, metric = 'Accuracy')
mod9am_c2_fit$results$Accuracy
[1] 0.8366462

(summary_rep <- summary(mod9am_c2_fit$finalModel))
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7233  -0.5446  -0.3510  -0.1994   2.7237  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) 103.00407   33.57102   3.068  0.00215 **
Cloud9am      0.16379    0.08014   2.044  0.04096 * 
Humidity9am   0.05462    0.01855   2.945  0.00323 **
Pressure9am  -0.10792    0.03298  -3.272  0.00107 **
MinTemp       0.06750    0.04091   1.650  0.09896 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.9  on 247  degrees of freedom
Residual deviance: 180.3  on 243  degrees of freedom
AIC: 190.3

Number of Fisher Scoring iterations: 5

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9990409

This second model shows similar accuracy values, however MinTemp p-value shows that such predictor is not significative. Further, the explained variance is slightly less than the first model one. As a conclusion, we select the first model for 9AM predictions purpose and we go on by calculating accuracy with the test set.

mod9am_pred <- predict(mod9am_c1_fit, testing)
confusionMatrix(mod9am_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  80  12
       Yes  6   7
                                          
               Accuracy : 0.8286          
                 95% CI : (0.7427, 0.8951)
    No Information Rate : 0.819           
    P-Value [Acc > NIR] : 0.4602          
                                          
                  Kappa : 0.3405          
 Mcnemar's Test P-Value : 0.2386          
                                          
            Sensitivity : 0.9302          
            Specificity : 0.3684          
         Pos Pred Value : 0.8696          
         Neg Pred Value : 0.5385          
             Prevalence : 0.8190          
         Detection Rate : 0.7619          
   Detection Prevalence : 0.8762          
      Balanced Accuracy : 0.6493          
                                          
       'Positive' Class : No

The confusion matrix shows encouraging results, not a bad test accuracy for a 9AM weather forecast. We then go on building the 3PM prediction model.

3PM Forecast Model

We are going to use what we expect to be reliable predictors at 3PM. We have already seen that they are not strongly correlated each other and they are not linearly dependent.

predictors_3pm_c1 <- c("Cloud3pm", "Humidity3pm", "Pressure3pm", "Temp3pm")
formula_3pm_c1 <- as.formula(paste("RainTomorrow", paste(predictors_3pm_c1, collapse="+"), sep="~"))
mod3pm_c1_fit <- train(formula_3pm_c1,  data = training, method = "glm", family = "binomial",
                       trControl = trControl, metric = 'Accuracy')
mod3pm_c1$_fitresults$Accuracy
[1] 0.8582333

This model shows an acceptable accuracy as measured on the training set.

(summary_rep <- summary(mod3pm_c1_fit$finalModel))
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3065  -0.5114  -0.2792  -0.1340   2.6641  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 122.56229   35.20777   3.481 0.000499 ***
Cloud3pm      0.27754    0.09866   2.813 0.004906 ** 
Humidity3pm   0.06745    0.01817   3.711 0.000206 ***
Pressure3pm  -0.12885    0.03443  -3.743 0.000182 ***
Temp3pm       0.10877    0.04560   2.385 0.017071 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 159.64  on 243  degrees of freedom
AIC: 169.64

Number of Fisher Scoring iterations: 6

All predictors are reported as significative for the model. Let us further verify if any predictor can be dropped.

drop1(mod3pm_c1_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm
            Df Deviance    AIC     LRT  Pr(>Chi)    
           159.64 169.64                      
Cloud3pm     1   168.23 176.23  8.5915  0.003377 ** 
Humidity3pm  1   175.91 183.91 16.2675 5.500e-05 ***
Pressure3pm  1   175.60 183.60 15.9551 6.486e-05 ***
Temp3pm      1   165.77 173.77  6.1329  0.013269 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999913

We go on with the computation of the test set accuracy.

mod3pm_pred <- predict(mod3pm_c1_fit, testing)
confusionMatrix(mod3pm_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  82   8
       Yes  4  11
                                          
               Accuracy : 0.8857          
                 95% CI : (0.8089, 0.9395)
    No Information Rate : 0.819           
    P-Value [Acc > NIR] : 0.04408         
                                          
                  Kappa : 0.58            
 Mcnemar's Test P-Value : 0.38648         
                                          
            Sensitivity : 0.9535          
            Specificity : 0.5789          
         Pos Pred Value : 0.9111          
         Neg Pred Value : 0.7333          
             Prevalence : 0.8190          
         Detection Rate : 0.7810          
   Detection Prevalence : 0.8571          
      Balanced Accuracy : 0.7662          
                                          
       'Positive' Class : No

The test set prediction accuracy is quite satisfactory.

Evening Forecast Model

As first evening forecast model, we introduce the Sunshine variable and at the same time we take Cloud3pm and Humidity3pm out as strongly correlated to Sunshine.

predictors_evening_c1 <- c("Pressure3pm", "Temp3pm", "Sunshine")
formula_evening_c1 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c1, collapse="+"), sep="~"))
mod_ev_c1_fit <- train(formula_evening_c1,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c1_fit$results$Accuracy
[1] 0.8466974

The training set based accuracy is acceptable.

(summary_rep <- summary(mod_ev_c1_fit$finalModel))
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9404  -0.4532  -0.2876  -0.1350   2.4664  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 174.43458   37.12201   4.699 2.61e-06 ***
Pressure3pm  -0.17175    0.03635  -4.726 2.29e-06 ***
Temp3pm       0.06506    0.03763   1.729   0.0838 .  
Sunshine     -0.41519    0.07036  -5.901 3.61e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 156.93  on 244  degrees of freedom
AIC: 164.93

Number of Fisher Scoring iterations: 6

The Temp3pm predictor is reported as not significative and can be dropped as confirmed below.

drop1(mod_ev_c1_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Pressure3pm + Temp3pm + Sunshine
            Df Deviance    AIC    LRT  Pr(>Chi)    
           156.93 164.93                     
Pressure3pm  1   185.64 191.64 28.712 8.400e-08 ***
Temp3pm      1   160.03 166.03  3.105   0.07805 .  
Sunshine     1   203.03 209.03 46.098 1.125e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999968

As a second tentative model, we take advantage of the 3PM model predictors together with WindGustDir and WindGustSpeed.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir", "WindGustSpeed")
formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~"))
mod_ev_c2_fit <- train(formula_evening_c2,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c2_fit$results$Accuracy
[1] 0.8681179

It results with an improved training set accuracy.

(summary_rep <- summary(mod_ev_c2_fit$finalModel))
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.44962  -0.42289  -0.20929  -0.04328   2.76204  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     9.928e+01  4.863e+01   2.042 0.041193 *  
Cloud3pm        2.569e-01  1.081e-01   2.376 0.017481 *  
Humidity3pm     8.441e-02  2.208e-02   3.822 0.000132 ***
Pressure3pm    -1.088e-01  4.695e-02  -2.318 0.020462 *  
Temp3pm         1.382e-01  5.540e-02   2.495 0.012596 *  
WindGustDirENE -1.064e+00  1.418e+00  -0.750 0.453254    
WindGustDirESE  1.998e+00  1.088e+00   1.837 0.066235 .  
WindGustDirN   -2.731e-03  1.759e+00  -0.002 0.998761    
WindGustDirNE   1.773e+00  1.260e+00   1.407 0.159364    
WindGustDirNNE -1.619e+01  2.884e+03  -0.006 0.995520    
WindGustDirNNW  1.859e+00  1.021e+00   1.821 0.068672 .  
WindGustDirNW   1.011e+00  1.009e+00   1.002 0.316338    
WindGustDirS    1.752e+00  1.248e+00   1.404 0.160394    
WindGustDirSE  -1.549e+01  2.075e+03  -0.007 0.994043    
WindGustDirSSE -1.051e-01  1.636e+00  -0.064 0.948777    
WindGustDirSSW -1.451e+01  6.523e+03  -0.002 0.998225    
WindGustDirSW   2.002e+01  6.523e+03   0.003 0.997551    
WindGustDirW    1.108e+00  1.183e+00   0.936 0.349051    
WindGustDirWNW  1.325e-01  1.269e+00   0.104 0.916848    
WindGustDirWSW -1.574e+01  4.367e+03  -0.004 0.997124    
WindGustSpeed   1.674e-02  2.065e-02   0.811 0.417420    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 135.61  on 227  degrees of freedom
AIC: 177.61

Number of Fisher Scoring iterations: 17

The WindGustDir and WindGustSpeed predictors are reported as not significative. Also the AIC value is sensibly increased.

drop1(mod_ev_c2_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + 
    WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + 
    WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + 
    WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + 
    WindGustDirWNW + WindGustDirWSW + WindGustSpeed
               Df Deviance    AIC     LRT  Pr(>Chi)    
              135.61 177.61                      
Cloud3pm        1   141.64 181.64  6.0340   0.01403 *  
Humidity3pm     1   154.49 194.49 18.8817 1.391e-05 ***
Pressure3pm     1   141.38 181.38  5.7692   0.01631 *  
Temp3pm         1   142.21 182.21  6.5992   0.01020 *  
WindGustDirENE  1   136.20 176.20  0.5921   0.44159    
WindGustDirESE  1   139.40 179.40  3.7883   0.05161 .  
WindGustDirN    1   135.61 175.61  0.0000   0.99876    
WindGustDirNE   1   137.55 177.55  1.9412   0.16354    
WindGustDirNNE  1   136.40 176.40  0.7859   0.37535    
WindGustDirNNW  1   139.43 179.43  3.8160   0.05077 .  
WindGustDirNW   1   136.69 176.69  1.0855   0.29747    
WindGustDirS    1   137.64 177.64  2.0293   0.15430    
WindGustDirSE   1   136.36 176.36  0.7458   0.38782    
WindGustDirSSE  1   135.61 175.61  0.0041   0.94866    
WindGustDirSSW  1   135.64 175.64  0.0339   0.85384    
WindGustDirSW   1   138.38 178.38  2.7693   0.09609 .  
WindGustDirW    1   136.52 176.52  0.9106   0.33996    
WindGustDirWNW  1   135.62 175.62  0.0109   0.91689    
WindGustDirWSW  1   135.85 175.85  0.2414   0.62318    
WindGustSpeed   1   136.27 176.27  0.6633   0.41541    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDir has some borderline p-value for some specific directions. WindGustSpeed is not significative and we should drop it from the model. Hence, we redefine such model after having taken WindGustSpeed off while keeping WindGustDir.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir")
formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~"))
mod_ev_c2_fit <- train(formula_evening_c2,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c2_fit$results$Accuracy
[1] 0.8574256

(summary_rep <- summary(mod_ev_c2_fit$finalModel))
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.45082  -0.40624  -0.21271  -0.04667   2.71193  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     118.76549   42.77047   2.777 0.005490 ** 
Cloud3pm          0.25566    0.10794   2.369 0.017859 *  
Humidity3pm       0.08052    0.02126   3.787 0.000152 ***
Pressure3pm      -0.12701    0.04172  -3.044 0.002333 ** 
Temp3pm           0.13035    0.05441   2.396 0.016592 *  
WindGustDirENE   -1.13954    1.43581  -0.794 0.427398    
WindGustDirESE    2.03313    1.10027   1.848 0.064624 .  
WindGustDirN     -0.00212    1.68548  -0.001 0.998996    
WindGustDirNE     1.59121    1.25530   1.268 0.204943    
WindGustDirNNE  -16.31932 2891.85796  -0.006 0.995497    
WindGustDirNNW    1.87617    1.03532   1.812 0.069962 .  
WindGustDirNW     1.09105    1.01928   1.070 0.284433    
WindGustDirS      1.93346    1.24036   1.559 0.119046    
WindGustDirSE   -15.50996 2073.57766  -0.007 0.994032    
WindGustDirSSE   -0.09029    1.63401  -0.055 0.955934    
WindGustDirSSW  -14.34617 6522.63868  -0.002 0.998245    
WindGustDirSW    19.99670 6522.63868   0.003 0.997554    
WindGustDirW      1.16834    1.18736   0.984 0.325127    
WindGustDirWNW    0.16961    1.28341   0.132 0.894858    
WindGustDirWSW  -15.71816 4349.40572  -0.004 0.997117    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 136.27  on 228  degrees of freedom
AIC: 176.27

Number of Fisher Scoring iterations: 17

drop1(mod_ev_c2_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + 
    WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + 
    WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + 
    WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + 
    WindGustDirWNW + WindGustDirWSW
               Df Deviance    AIC     LRT  Pr(>Chi)    
              136.27 176.27                      
Cloud3pm        1   142.24 180.24  5.9653  0.014590 *  
Humidity3pm     1   154.50 192.50 18.2255 1.962e-05 ***
Pressure3pm     1   146.76 184.76 10.4852  0.001203 ** 
Temp3pm         1   142.33 180.33  6.0611  0.013819 *  
WindGustDirENE  1   136.93 174.93  0.6612  0.416126    
WindGustDirESE  1   140.13 178.13  3.8568  0.049545 *  
WindGustDirN    1   136.27 174.27  0.0000  0.998996    
WindGustDirNE   1   137.87 175.87  1.5970  0.206332    
WindGustDirNNE  1   137.14 175.14  0.8698  0.351012    
WindGustDirNNW  1   140.09 178.09  3.8159  0.050769 .  
WindGustDirNW   1   137.52 175.52  1.2477  0.263994    
WindGustDirS    1   138.78 176.78  2.5032  0.113617    
WindGustDirSE   1   137.03 175.03  0.7541  0.385179    
WindGustDirSSE  1   136.28 174.28  0.0031  0.955862    
WindGustDirSSW  1   136.30 174.30  0.0290  0.864850    
WindGustDirSW   1   139.00 177.00  2.7314  0.098393 .  
WindGustDirW    1   137.28 175.28  1.0100  0.314905    
WindGustDirWNW  1   136.29 174.29  0.0174  0.894921    
WindGustDirWSW  1   136.51 174.51  0.2373  0.626164    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDirESE is reported as significant. Hence to include WindGustDir is right and then we accept that model as a candidate one. The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999998

To investigate a final third choice, we gather a small set of predictors, Pressure3pm and Sunshine.

predictors_evening_c3 <- c("Pressure3pm", "Sunshine")
formula_evening_c3 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c3, collapse="+"), sep="~"))
mod_ev_c3_fit <- train(formula_evening_c3,  data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy')
mod_ev_c3_fit$results$Accuracy

[1] 0.8484513

The training set based accuracy has an acceptable value.

(summary_rep <- summary(mod_ev_c3_fit$finalModel))
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1123  -0.4602  -0.2961  -0.1562   2.3997  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 192.97146   36.09640   5.346 8.99e-08 ***
Pressure3pm  -0.18913    0.03545  -5.336 9.52e-08 ***
Sunshine     -0.35973    0.06025  -5.971 2.36e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.90  on 247  degrees of freedom
Residual deviance: 160.03  on 245  degrees of freedom
AIC: 166.03

Number of Fisher Scoring iterations: 6

All predictors are reported as significative.

drop1(mod_ev_c3_fit$finalModel, test="Chisq")
Single term deletions

Model:
.outcome ~ Pressure3pm + Sunshine
            Df Deviance    AIC    LRT  Pr(>Chi)    
           160.03 166.03                     
Pressure3pm  1   199.36 203.36 39.324 3.591e-10 ***
Sunshine     1   206.09 210.09 46.060 1.147e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual)
[1] 0.9999938

We compare the last two models by running an ANOVA analysis on those to check if the lower residual deviance of the first model is significative or not.

anova(mod_ev_c2_fit$finalModel, mod_ev_c3_fit$finalModel, test="Chisq")
Analysis of Deviance Table

Model 1: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + 
    WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + 
    WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + 
    WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + 
    WindGustDirWNW + WindGustDirWSW
Model 2: .outcome ~ Pressure3pm + Sunshine
  Resid. Df Resid. Dev  Df Deviance Pr(>Chi)
1       228     136.27                      
2       245     160.03 -17   -23.76   0.1261

Based on p-value, there is no significative difference between them. We then choose both models. The first model because it provides with a better accuracy than the second. The second model for its simplicity. Let us evaluate the test accuracy for both of them.

modevening_pred <- predict(mod_ev_c2_fit, testing)
confusionMatrix(modevening_pred, testing[,"RainTomorrow"])

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  83   9
       Yes  3  10
                                          
               Accuracy : 0.8857          
                 95% CI : (0.8089, 0.9395)
    No Information Rate : 0.819           
    P-Value [Acc > NIR] : 0.04408         
                                          
                  Kappa : 0.5604          
 Mcnemar's Test P-Value : 0.14891         
                                          
            Sensitivity : 0.9651          
            Specificity : 0.5263          
         Pos Pred Value : 0.9022          
         Neg Pred Value : 0.7692          
             Prevalence : 0.8190          
         Detection Rate : 0.7905          
   Detection Prevalence : 0.8762          
      Balanced Accuracy : 0.7457          
                                          
       'Positive' Class : No

Good test accuracy with a high sensitivity and positive predicted values. We then test the second evening forecast candidate model.

modevening_pred <- predict(mod_ev_c3_fit, testing)
confusionMatrix(modevening_pred, testing[,"RainTomorrow"])
Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  81   7
       Yes  5  12
                                          
               Accuracy : 0.8857          
                 95% CI : (0.8089, 0.9395)
    No Information Rate : 0.819           
    P-Value [Acc > NIR] : 0.04408         
                                          
                  Kappa : 0.598           
 Mcnemar's Test P-Value : 0.77283         
                                          
            Sensitivity : 0.9419          
            Specificity : 0.6316          
         Pos Pred Value : 0.9205          
         Neg Pred Value : 0.7059          
             Prevalence : 0.8190          
         Detection Rate : 0.7714          
   Detection Prevalence : 0.8381          
      Balanced Accuracy : 0.7867          
                                          
       'Positive' Class : No

Slightly higher accuracy and lower sensitivity than the previous model. Positive predicitive value is improved with respect the same.

Resuming up our final choices:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am
mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm
mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir
mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

We save the training record indexes, the training and testing datasets together with all the chosen models for further analysis.

saveRDS(list(weather_data5, train_rec, training, testing, mod9am_c1_fit, mod9am_c2_fit, mod3pm_c1_fit, mod_ev_c2_fit, mod_ev_c3_fit), file="wf_log_reg_part2.rds")

Conclusions

We build models for weather forecasts purposes at different hours of the day. We explored predictors significancy and training accuracy to determine which are the best models. We computed test accuracy and collect results. Prediction accuracy was shown to be quite satisfactory for 3PM and evening models. In next part of this tutorial, we will tune all those models to improve their accuracy if possible.

If you have any questions, please feel free to comment below.

caretLogistic Regression