In the second part (first part is here) of this tutorial, we are going to build two types of classification models and compare their performances in terms of accuracy.
Packages
The overall list of packages used for this tutorial (part #1 and part #2) are as follows.
suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(Kmisc)) suppressPackageStartupMessages(library(gmodels)) suppressPackageStartupMessages(library(ggparallel)) suppressPackageStartupMessages(library(rpart.plot)) suppressPackageStartupMessages(library(sqldf))
Classification Models
We are going to take advantage of the caret package (ref. [8]) to build models using rpart and C5.0Rules classification models. As a first step, we define the training and validation datasets and the model formula. The original dataset is split into 60% and 40% proportions to obtain the training dataset and validation datasets. The training procedure will take advantage of cross-validation in number of 10 folds.
url_file <- "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
mushrooms <- read.csv(url(url_file), header=FALSE)
fields <- c("class",
"cap_shape",
"cap_surface",
"cap_color",
"bruises",
"odor",
"gill_attachment",
"gill_spacing",
"gill_size",
"gill_color",
"stalk_shape",
"stalk_root",
"stalk_surface_above_ring",
"stalk_surface_below_ring",
"stalk_color_above_ring",
"stalk_color_below_ring",
"veil_type",
"veil_color",
"ring_number",
"ring_type",
"spore_print_color",
"population",
"habitat")
colnames(mushrooms) <- fields
set.seed(1023)
train_idx <- createDataPartition(mushrooms$class, p=0.6, list=FALSE)
trControl <- trainControl(method = "repeatedcv", number=10, repeats=5, verboseIter=TRUE)
feat_sum <- paste(relevant_features, collapse = "+")
frm <- as.formula(paste("class ~ ", feat_sum))
frm
class ~ cap_shape + cap_surface + cap_color + bruises + odor +
gill_attachment + gill_spacing + gill_size + gill_color +
stalk_shape + stalk_root + stalk_surface_above_ring + stalk_surface_below_ring +
stalk_color_above_ring + stalk_color_below_ring + veil_color +
ring_number + ring_type + spore_print_color + population +
habitat
The formula above shall be used in the model we are going to build.
RPART
We are going to take advantage of rpart classification model (ref. [9]) capable to build Recursive Partitioning Tree models. The rpart routine builds classification or regression models of a very general structure using a two stage procedure; the resulting models can be represented as binary trees. The tree is built by the following process: first the single variable is found which best splits the data into two groups (‘best’ according to some criteria). The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. For further details please see ref. [9].
In the following, we set the threshold complexity parameter, cp, to zero in order to have no costs for adding a split into the output classification tree being build. That has to be done carefully as a low cp value may result in overfitting. The accuracy is the metric we want to take into account.
rpart.grid <- expand.grid(.cp=0)
rpart_fit <- train(frm,
data = mushrooms[train_idx,],
method ="rpart",
trControl = trControl,
tuneGrid=rpart.grid,
metric = 'Accuracy')
rpart_fit
CART
4875 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4387, 4387, 4388, 4388, 4387, 4387, ...
Resampling results:
Accuracy Kappa
0.9975385 0.9950708
Tuning parameter 'cp' was held constant at a value of 0
Variables importance report is shown herein below.
varImp(rpart_fit)
rpart variable importance
only 20 most important variables shown (out of 95)
Overall
odorn 100.000
odorf 63.133
stalk_surface_above_ringk 60.013
stalk_surface_below_ringk 53.376
gill_sizen 48.751
bruisest 38.092
odorl 37.967
stalk_rootc 29.069
ring_typep 28.692
habitatm 18.715
stalk_surface_below_ringy 16.513
stalk_rootr 14.035
spore_print_colorr 4.717
spore_print_coloru 4.123
gill_spacingw 4.047
odorm 2.987
stalk_surface_above_rings 2.987
stalk_surface_below_rings 2.987
stalk_color_below_ringy 2.019
cap_colory 1.892
To have no odor or a foul odor are considered relevant details in determining if mushrooms are edible or poisonous. To evaluate stalk surface and gill size are further aspects to be considered. We then verify our model against the validation dataset.
rpart_test_pred <- predict(rpart_fit, mushrooms[-train_idx,])
confusionMatrix(rpart_test_pred, mushrooms[-train_idx,]$class)
Confusion Matrix and Statistics
Reference
Prediction e p
e 1683 0
p 0 1566
Accuracy : 1
95% CI : (0.9989, 1)
No Information Rate : 0.518
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.518
Detection Rate : 0.518
Detection Prevalence : 0.518
Balanced Accuracy : 1.000
'Positive' Class : e
100% accuracy is achieved. Plot of the resulting classification model is shown.
rpart.plot(rpart_fit$finalModel, cex=0.6)
C5.0Rules
Decision trees can sometimes be quite difficult to understand. An important feature of C5.0 is its ability to generate classifiers called rulesets that consist of unordered collections of (relatively) simple if-then rules (ref. [10] and [11]). We are going to take advantage of the same train control directive and target metric.
c50_fit <- train(frm,
data = mushrooms[train_idx,],
method ="C5.0Rules",
trControl = trControl,
metric = 'Accuracy')
c50_fit
Single C5.0 Ruleset
4875 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4388, 4388, 4388, 4387, 4387, 4387, ...
Resampling results:
Accuracy Kappa
1 1
Variables importance is shown.
varImp(c50_fit)
C5.0Rules variable importance
only 20 most important variables shown (out of 95)
Overall
odorn 100.00
bruisest 72.92
stalk_rootc 57.34
stalk_rootr 52.84
gill_spacingw 51.58
spore_print_colorr 44.79
gill_sizen 44.69
stalk_surface_below_ringy 20.38
cap_shapex 0.00
veil_colory 0.00
populationy 0.00
ring_typen 0.00
gill_coloro 0.00
cap_colorc 0.00
spore_print_coloru 0.00
habitatu 0.00
stalk_color_below_ringc 0.00
spore_print_colork 0.00
cap_surfaces 0.00
habitatw 0.00
As for rpart model, to have no odor is the most important feature. Bruise and stalk characteristics follow. In case of C5.0Rules, the important variables are less than rpart ones. So, C5.0Rules model was capable to focus on a smaller variable set to achieve the same accuracy as we will evaluate later also for the validation dataset.
summary(c50_fit) Rules: Rule 1: (1926, lift 1.9) odorn > 0 gill_sizen <= 0 spore_print_colorr class e [0.999] Rule 2: (867, lift 1.9) bruisest 0 stalk_surface_below_ringy class e [0.999] Rule 3: (313, lift 1.9) bruisest > 0 stalk_rootc > 0 -> class e [0.997] Rule 4: (116, lift 1.9) stalk_rootr > 0 -> class e [0.992] Rule 5: (61, lift 1.9) bruisest > 0 odorn 0 -> class e [0.984] Rule 6: (2200, lift 2.1) odorn <= 0 gill_spacingw <= 0 stalk_rootc <= 0 stalk_rootr class p [1.000] Rule 7: (1948, lift 2.1) bruisest <= 0 odorn class p [0.999] Rule 8: (37, lift 2.0) spore_print_colorr > 0 -> class p [0.974] Rule 9: (26, lift 2.0) gill_sizen > 0 stalk_surface_below_ringy > 0 -> class p [0.964] Rule 10: (7, lift 1.8) bruisest > 0 odorn > 0 gill_sizen > 0 -> class p [0.889] Default class: e Evaluation on training data (4875 cases): Rules ---------------- No Errors 10 0( 0.0%) << (a) (b) <-classified as ---- ---- 2525 (a): class e 2350 (b): class p Attribute usage: 89.91% odorn 65.56% bruisest 51.55% stalk_rootc 47.51% stalk_rootr 46.38% gill_spacingw 40.27% spore_print_colorr 40.18% gill_sizen 18.32% stalk_surface_below_ringy
Each rule consists of:
* a rule number — this is quite arbitrary and serves only to identify the rule.
* statistics (n, lift x) or (n/m, lift x) that summarize the performance of the rule. Similarly to a leaf, n is the number of training cases covered by the rule and m, if it appears, shows how many of them do not belong to the class predicted by the rule. The rule’s accuracy is estimated by the Laplace ratio (n-m+1)/(n+2). The lift x is the result of dividing the rule’s estimated accuracy by the relative frequency of the predicted class in the training set.
* one or more conditions that must all be satisfied if the rule is to be applicable.
* a class predicted by the rule
* a value between 0 and 1 that indicates the confidence with which this prediction is made
Now we will evaluate our model against the validation set.
c50_test_pred <- predict(c50_fit, mushrooms[-train_idx,])
confusionMatrix(c50_test_pred, mushrooms[-train_idx,]$class)
Confusion Matrix and Statistics
Reference
Prediction e p
e 1683 0
p 0 1566
Accuracy : 1
95% CI : (0.9989, 1)
No Information Rate : 0.518
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.518
Detection Rate : 0.518
Detection Prevalence : 0.518
Balanced Accuracy : 1.000
'Positive' Class : e
100% accuracy reached by C5.0Rules as well.
Comparing Models
The caret package allows for comparing models performances. The resamples function can be used to collect, summarize and contrast the resampling results. Since the random number seeds were initialized to the same value prior to calling train, the same folds were used for each model (ref. [12]).
results <- resamples(list(RPART=rpart_fit, C5.0Rules=c50_fit)) bwplot(results)
It results C5.0Rules model is slightly more reliable than rpart based one as it provides a higher minimum accuracy.
Further Considerations
What if only features related to visual characteristics of mushrooms were available? We are going to recompute our models with a subset of the features available, only those immediately visually observable. At the purpose we define a new set of features and corresponding model formula.
relevant_features_2 <- setdiff(relevant_features,
c("population", "habitat", "odor", "bruises"))
feat_sum <- paste(relevant_features_2, collapse = "+")
frm <- as.formula(paste("class ~ ", feat_sum))
frm
class ~ cap_shape + cap_surface + cap_color + gill_attachment +
gill_spacing + gill_size + gill_color + stalk_shape + stalk_root +
stalk_surface_above_ring + stalk_surface_below_ring + stalk_color_above_ring +
stalk_color_below_ring + veil_color + ring_number + ring_type +
spore_print_color
With rpart:
rpart.grid <- expand.grid(.cp=0)
rpart_fit <- train(frm,
data = mushrooms[train_idx,],
method ="rpart",
trControl = trControl,
tuneGrid=rpart.grid,
metric = 'Accuracy')
rpart_fit
CART
4875 samples
17 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4388, 4388, 4387, 4388, 4387, 4387, ...
Resampling results:
Accuracy Kappa
0.9991803 0.9983589
Tuning parameter 'cp' was held constant at a value of 0
varImp(rpart_fit)
rpart variable importance
only 20 most important variables shown (out of 75)
Overall
gill_sizen 100.000
stalk_surface_below_ringk 66.448
ring_typep 61.468
stalk_surface_above_ringk 57.196
stalk_surface_above_rings 38.960
spore_print_colorh 35.409
spore_print_colorw 25.831
gill_spacingw 14.703
cap_surfaces 13.711
ring_numbert 11.714
ring_numbero 9.179
spore_print_colorr 9.136
cap_colorw 6.227
ring_typef 5.981
stalk_shapet 5.937
spore_print_coloru 4.092
stalk_rootb 3.868
cap_colorg 3.504
stalk_roote 3.451
stalk_color_below_ringn 2.548
rpart_test_pred <- predict(rpart_fit, mushrooms[-train_idx,])
confusionMatrix(rpart_test_pred, mushrooms[-train_idx,]$class)
Confusion Matrix and Statistics
Reference
Prediction e p
e 1683 0
p 0 1566
Accuracy : 1
95% CI : (0.9989, 1)
No Information Rate : 0.518
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.518
Detection Rate : 0.518
Detection Prevalence : 0.518
Balanced Accuracy : 1.000
'Positive' Class : e
rpart.plot(rpart_fit$finalModel, cex=0.6)
Now with C5.0Rules.
c50_fit <- train(frm,
data = mushrooms[train_idx,],
method ="C5.0Rules",
trControl = trControl,
metric = 'Accuracy')
c50_fit
Single C5.0 Ruleset
4875 samples
17 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 4388, 4387, 4388, 4387, 4388, 4388, ...
Resampling results:
Accuracy Kappa
1 1
varImp(c50_fit)
C5.0Rules variable importance
only 20 most important variables shown (out of 75)
Overall
gill_sizen 100.00
stalk_surface_above_ringk 83.17
spore_print_colorh 70.47
spore_print_colorr 49.81
spore_print_colorw 44.94
gill_spacingw 31.60
stalk_rootb 27.40
stalk_color_below_ringn 26.02
stalk_shapet 23.25
cap_surfacey 21.25
ring_typef 14.68
cap_surfaces 10.13
gill_colorh 0.00
ring_numbert 0.00
cap_shapek 0.00
gill_colory 0.00
stalk_color_below_ringw 0.00
gill_colorr 0.00
veil_coloro 0.00
ring_typen 0.00
summary(c50_fit) Call: Rules: Rule 1: (2267, lift 1.9) gill_sizen <= 0 stalk_surface_above_ringk <= 0 spore_print_colorh <= 0 spore_print_colorr class e [1.000] Rule 2: (328, lift 1.9) cap_surfaces <= 0 cap_surfacey <= 0 stalk_rootb <= 0 stalk_surface_above_ringk class e [0.997] Rule 3: (88, lift 1.9) gill_spacingw > 0 stalk_surface_above_ringk > 0 -> class e [0.989] Rule 4: (61, lift 1.9) gill_sizen > 0 stalk_shapet > 0 spore_print_colorw class e [0.984] Rule 5: (34, lift 1.9) stalk_surface_above_ringk 0 -> class e [0.972] Rule 6: (29, lift 1.9) ring_typef > 0 -> class e [0.968] Rule 7: (919, lift 2.1) stalk_shapet 0 spore_print_colorw class p [0.999] Rule 8: (940, lift 2.1) gill_sizen 0 -> class p [0.999] Rule 9: (1065, lift 2.1) gill_sizen > 0 stalk_color_below_ringn 0 -> class p [0.999] Rule 10: (1350, lift 2.1) gill_spacingw 0 -> class p [0.999] Rule 11: (639, lift 2.1) cap_surfacey > 0 gill_sizen > 0 stalk_color_below_ringn <= 0 ring_typef class p [0.998] Rule 12: (133, lift 2.1) cap_surfaces > 0 gill_sizen > 0 stalk_shapet class p [0.993] Default class: e Evaluation on training data (4875 cases): Rules ---------------- No Errors 12 0( 0.0%) << (a) (b) <-classified as ---- ---- 2525 (a): class e 2350 (b): class p Attribute usage: 93.35% gill_sizen 77.64% stalk_surface_above_ringk 65.78% spore_print_colorh 46.50% spore_print_colorr 41.95% spore_print_colorw 29.50% gill_spacingw 25.58% stalk_rootb 24.29% stalk_color_below_ringn 21.70% stalk_shapet 19.84% cap_surfacey 13.70% ring_typef 9.46% cap_surfaces
c50_test_pred <- predict(c50_fit, mushrooms[-train_idx,])
confusionMatrix(c50_test_pred, mushrooms[-train_idx,]$class)
Confusion Matrix and Statistics
Reference
Prediction e p
e 1683 0
p 0 1566
Accuracy : 1
95% CI : (0.9989, 1)
No Information Rate : 0.518
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.518
Detection Rate : 0.518
Detection Prevalence : 0.518
Balanced Accuracy : 1.000
'Positive' Class : e
Both rpart and C5.0Rules have been able to achieve 100% accuracy again with a limited increase of complexity for the resulting models compared with the ones using former features set. Let us compare again resulting rpart and C5.0Rules models.
results <- resamples(list(RPART=rpart_fit, C5.0Rules=c50_fit)) bwplot(results)
Conclusions
Both rpart and C5.0Rules were able to achieve very high accuracy. We compared them using two features sets. If you have any questions, please feel free to comment below.
References
[1] UCI Machine Learning Archive – Mushroom Dataset
[2] Wikipedia – Mushroom Tutorial
[3] Mushroom Anatomy
[4] Identify Mushrooms
[5] Wikipedia – Universal Veil
[6] Wikipedia – Partial Veil
[7] Mushroom Glossary
[8] Caret package site
[9] rpart package vignette
[10] C5.0 package vignette
[11] C5.0: An Informal Tutorial
[12] Caret Package Vignette
[13] How Decision Tree Algorithms Works



