1 Introdution

DALEX is designed to work with various black-box models like tree ensembles, linear models, neural networks etc. Unfortunately R packages that create such models are very inconsistent. Different tools use different interfaces to train, validate and use models.

In this vignette we will show explanations for models from caret package (Jed Wing et al. 2016).

2 Regression use case - apartments data

library(DALEX)
library(caret)

To illustrate applications of DALEX to regression problems we will use an artificial dataset apartments available in the DALEX package. Our goal is to predict the price per square meter of an apartment based on selected features such as construction year, surface, floor, number of rooms, district. It should be noted that four of these variables are continuous while the fifth one is a categorical one. Prices are given in Euro.

data(apartments)
head(apartments)
##   m2.price construction.year surface floor no.rooms    district
## 1     5897              1953      25     3        1 Srodmiescie
## 2     1818              1992     143     9        5     Bielany
## 3     3643              1937      56     1        2       Praga
## 4     3517              1995      93     7        3      Ochota
## 5     3013              1992     144     6        5     Mokotow
## 6     5795              1926      61     6        2 Srodmiescie

2.1 The explain() function

The first step of using the DALEX package is to wrap-up the black-box model with meta-data that unifies model interfacing.

Below, we use the caret function train() to fit 3 models: random forest, gradient boosting machine model, and neutral network.

set.seed(123)
regr_rf <- train(m2.price~., data = apartments, method = "rf", ntree = 100)

regr_gbm <- train(m2.price~. , data = apartments, method = "gbm")

regr_nn <- train(m2.price~., data = apartments,
                   method = "nnet",
                   linout = TRUE,
                   preProcess = c('center', 'scale'),
                   maxit = 500,
                   tuneGrid = expand.grid(size = 2, decay = 0),
                   trControl = trainControl(method = "none", seeds = 1))

To create an explainer for these models it is enough to use explain() function with the model, data and y parameters. Validation dataset for the models is apartmentsTest data from the DALEX package.

data(apartmentsTest)

explainer_regr_rf <- DALEX::explain(regr_rf, label = "rf", 
                                    data = apartmentsTest, y = apartmentsTest$m2.price, 
                                    verbose = FALSE)

explainer_regr_gbm <- DALEX::explain(regr_gbm, label = "gbm", 
                                     data = apartmentsTest, y = apartmentsTest$m2.price,
                                     verbose = FALSE)

explainer_regr_nn <- DALEX::explain(regr_nn, label = "nn", 
                                    data = apartmentsTest, y = apartmentsTest$m2.price,
                                    verbose = FALSE)

2.2 Model performance

Function model_performance() calculates predictions and residuals for validation dataset.

mp_regr_rf <- model_performance(explainer_regr_rf)
mp_regr_gbm <- model_performance(explainer_regr_gbm)
mp_regr_nn <- model_performance(explainer_regr_nn)

Generic function print() returns quantiles for residuals.

mp_regr_rf
## Measures for:  regression
## mse        : 24718.38 
## rmse       : 157.2208 
## r2         : 0.9695138 
## mad        : 59.09508
## 
## Residuals:
##         0%        10%        20%        30%        40%        50%        60% 
## -675.69367 -138.01935  -80.66407  -49.04287  -25.68403   -7.54125   10.41347 
##        70%        80%        90%       100% 
##   34.76663   81.12960  209.95295  754.69767

Generic function plot() shows reversed empirical cumulative distribution function for absolute values from residuals. Plots can be generated for one or more models.

plot(mp_regr_rf, mp_regr_nn, mp_regr_gbm)

The figure above shows that majority of residuals for random forest and gbm are smaller than residuals for the neural network.

We are also able to use the plot() function to get an alternative comparison of residuals. Setting the geom = "boxplot" parameter we can compare the distribution of residuals for selected models.

plot(mp_regr_rf, mp_regr_nn, mp_regr_gbm, geom = "boxplot")

2.3 Variable importance

Using he DALEX package we are able to better understand which variables are important.

Model agnostic variable importance is calculated by means of permutations. We simply substract the loss function calculated for validation dataset with permuted values for a single variable from the loss function calculated for validation dataset.

This method is implemented in the model_parts() function.

vi_regr_rf <- model_parts(explainer_regr_rf, loss_function = loss_root_mean_square)
vi_regr_gbm <- model_parts(explainer_regr_gbm, loss_function = loss_root_mean_square)
vi_regr_nn <- model_parts(explainer_regr_nn, loss_function = loss_root_mean_square)

We can compare all models using the generic plot() function.

plot(vi_regr_rf, vi_regr_gbm, vi_regr_nn)

Left edges of intervals start in full model, as we can see performances for random forest and gbm are similar, while neutral network has worse performace.

Length of the interval coresponds to a variable importance. Longer interval means larger loss, so the variable is more important. For random forest and gbm the rankings of the important variables are the same.

2.4 Model profile

Explainers presented in this section are designed to better understand the relation between a variable and model output.

For more details of methods desribed in this section see - Partial-dependence Profiles - Local-dependence and Accumulated-local Profiles

2.4.1 Partial Dependence Plot

Partial Dependence Plots (PDP) are one of the most popular methods for exploration of the relation between a continuous variable and the model outcome.

Function model_profile() with the parameter type = "partial" to calculate PDP response.

pdp_regr_rf  <- model_profile(explainer_regr_rf, variable =  "construction.year", type = "partial")
pdp_regr_gbm  <- model_profile(explainer_regr_gbm, variable =  "construction.year", type = "partial")
pdp_regr_nn  <- model_profile(explainer_regr_nn, variable =  "construction.year", type = "partial")

plot(pdp_regr_rf, pdp_regr_gbm, pdp_regr_nn)

We use PDP plots to compare our 3 models. As we can see above performance of random forest and gbm is very similar. It looks like they capture the non-linear relation which wasn’t captured by neutral network.

2.4.2 Acumulated Local Effects plot

Acumulated Local Effects (ALE) plot is the extension of PDP, that is more suited for highly correlated variables.

Function model_profile() with the parameter type = "accumulated" to calculate the ALE curve for the variable construction.year.

ale_regr_rf  <- model_profile(explainer_regr_rf, variable =  "construction.year", type = "accumulated")
ale_regr_gbm  <- model_profile(explainer_regr_gbm, variable =  "construction.year", type = "accumulated")
ale_regr_nn  <- model_profile(explainer_regr_nn, variable =  "construction.year", type = "accumulated")

plot(ale_regr_rf, ale_regr_gbm, ale_regr_nn)

2.4.3 Partial Dependence Profile for categorical variable

Function model_profile() with the parameter type = "partial" for categorical variable.

mpp_regr_rf  <- model_profile(explainer_regr_rf, variable =  "district", type = "partial")
mpp_regr_gbm  <- model_profile(explainer_regr_gbm, variable =  "district", type = "partial")
mpp_regr_nn  <- model_profile(explainer_regr_nn, variable =  "district", type = "partial")

plot(mpp_regr_rf, mpp_regr_gbm, mpp_regr_nn)

We can note three groups: the city center (Srodmiescie), districts well communicated with city center (Ochota, Mokotow, Zoliborz) and other districts closer to city boundaries.

3 Classification use case - wine data

To illustrate applications of DALEX to classification problems we will use a wine dataset available in the breakDown package. We want to classify the quality of wine. Originally this variable has 7 levels but in our example, it will be reduced to the binary classification. Our classification will be based on eleven features from this data set.

White wine quality data is related to variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/.

library(breakDown)
data(wine)

wine$quality <- factor(ifelse(wine$quality > 5, 1, 0))

First, we use caret function createDataPartition() to create train and test datasets.

trainIndex <- createDataPartition(wine$quality, p = 0.6, list = FALSE, times = 1)
wineTrain <- wine[ trainIndex,]
wineTest  <- wine[-trainIndex,]

Next, we use train() to fit 3 classification models: random forest, logistic regression and support vector machines.

classif_rf <- train(quality~., data = wineTrain, method = "rf", ntree = 100, tuneLength = 1)

classif_glm <- train(quality~., data = wineTrain, method = "glm", family = "binomial")

classif_svm <- train(quality~., data = wineTrain, method = "svmRadial", prob.model = TRUE, tuneLength = 1)

As previously, to create an explainer for these models we use explain() function. Validation dataset for the models is wineTest.

In this case we consider the differences between observed class and predicted probabilities to be residuals. So, we have to provide custom predict function which takes two arguments: model and newdata and returns a numeric vector with probabilities.

p_fun <- function(object, newdata){predict(object, newdata=newdata, type="prob")[,2]}
yTest <- as.numeric(as.character(wineTest$quality))

explainer_classif_rf <- DALEX::explain(classif_rf, label = "rf",
                                       data = wineTest, y = yTest,
                                       predict_function = p_fun, 
                                       verbose = FALSE)

explainer_classif_glm <- DALEX::explain(classif_glm, label = "glm", 
                                        data = wineTest, y = yTest,
                                        predict_function = p_fun,
                                        verbose = FALSE)

explainer_classif_svm <- DALEX::explain(classif_svm,  label = "svm", 
                                        data = wineTest, y = yTest,
                                        predict_function = p_fun,
                                        verbose = FALSE)

3.1 Model performance

Function model_performance() calculates predictions and residuals for validation dataset wineTest.

We use the generic plot() function to get a comparison of models.

mp_classif_rf <- model_performance(explainer_classif_rf)
mp_classif_glm <- model_performance(explainer_classif_glm)
mp_classif_svm <- model_performance(explainer_classif_svm)

plot(mp_classif_rf, mp_classif_glm, mp_classif_svm)

Setting the geom = "boxplot" parameter we can compare the distribution of residuals for selected models.

plot(mp_classif_rf, mp_classif_glm, mp_classif_svm, geom = "boxplot")

3.2 Variable importance

Function model_parts() computes variable importances which may be plotted.

vi_classif_rf <- model_parts(explainer_classif_rf, loss_function = loss_root_mean_square)
vi_classif_glm <- model_parts(explainer_classif_glm, loss_function = loss_root_mean_square)
vi_classif_svm <- model_parts(explainer_classif_svm, loss_function = loss_root_mean_square)

plot(vi_classif_rf, vi_classif_glm, vi_classif_svm)

Left edges of intervals start in full model. Length of the interval coresponds to a variable importance. Longer interval means larger loss, so the variable is more important.

3.2.1 Partial Depedence Plot

As previously we create explainers which are designed to better understand the relation between a variable and model output: PDP plots and ALE plots.

pdp_classif_rf  <- model_profile(explainer_classif_rf, variable = "pH", type = "partial")
pdp_classif_glm  <- model_profile(explainer_classif_glm, variable = "pH", type = "partial")
pdp_classif_svm  <- model_profile(explainer_classif_svm, variable = "pH", type = "partial")

plot(pdp_classif_rf, pdp_classif_glm, pdp_classif_svm)

3.2.2 Acumulated Local Effects plot

ale_classif_rf  <- model_profile(explainer_classif_rf, variable = "alcohol", type = "accumulated")
ale_classif_glm  <- model_profile(explainer_classif_glm, variable = "alcohol", type = "accumulated")
ale_classif_svm  <- model_profile(explainer_classif_svm, variable = "alcohol", type = "accumulated")

plot(ale_classif_rf, ale_classif_glm, ale_classif_svm)

4 Session Info

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
## [3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
## [5] LC_TIME=Polish_Poland.1250    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] breakDown_0.2.0 caret_6.0-86    ggplot2_3.3.0   lattice_0.20-38
## [5] DALEX_2.0.1    
## 
## loaded via a namespace (and not attached):
##  [1] gbm_2.1.5            tidyselect_1.1.0     xfun_0.12           
##  [4] kernlab_0.9-29       purrr_0.3.3          reshape2_1.4.3      
##  [7] splines_3.6.3        colorspace_1.4-1     vctrs_0.3.1         
## [10] generics_0.0.2       stats4_3.6.3         htmltools_0.4.0     
## [13] yaml_2.2.1           survival_3.1-8       prodlim_2019.11.13  
## [16] rlang_0.4.6          e1071_1.7-3          ModelMetrics_1.2.2.2
## [19] pillar_1.4.3         glue_1.3.2           withr_2.1.2         
## [22] foreach_1.4.8        lifecycle_0.2.0      plyr_1.8.5          
## [25] lava_1.6.7           stringr_1.4.0        timeDate_3043.102   
## [28] munsell_0.5.0        gtable_0.3.0         recipes_0.1.10      
## [31] codetools_0.2-16     evaluate_0.14        labeling_0.3        
## [34] knitr_1.28           class_7.3-15         Rcpp_1.0.4          
## [37] scales_1.1.0         ipred_0.9-9          ingredients_2.0     
## [40] farver_2.0.3         gridExtra_2.3        digest_0.6.25       
## [43] stringi_1.4.6        dplyr_1.0.0          grid_3.6.3          
## [46] tools_3.6.3          magrittr_1.5         tibble_2.1.3        
## [49] randomForest_4.6-14  crayon_1.3.4         pkgconfig_2.0.3     
## [52] MASS_7.3-51.5        Matrix_1.2-18        data.table_1.12.8   
## [55] pROC_1.16.1          lubridate_1.7.4      gower_0.2.1         
## [58] rmarkdown_2.1        iterators_1.0.12     R6_2.4.1            
## [61] rpart_4.1-15         nnet_7.3-12          nlme_3.1-144        
## [64] compiler_3.6.3