The project aims at developing machine learning methods and tools for Variable Importance Measurement (VIM) and Variable Selection in statistical prediction problems. The topic is analyzed from a methodological and empirical point of view; specific computational functions are built for the new proposed procedures. From the point of view of the methods, the main focus is on ensemble learning techniques.
Scientific coordinators: Paola Zuccolotto, Marco Sandri
Sandri M., Zuccolotto P. (2008), A bias correction algorithm for the Gini variable importance measure in classification trees, Journal of Computational and Graphical Statistics, 17(3), 611-628.
This article considers a measure of variable importance frequently used in variable-selection methods based on decision trees and tree-based ensemble models. These models include CART, random forests, and gradient boosting machine. The measure of variable importance is defined as the total heterogeneity reduction produced by a given covariate on the response variable when the sample space is recursively partitioned. Despite its popularity, some authors have shown that this measure is biased to the extent that, under certain conditions, there may be dangerous effects on variable selection. Here we present a simple and effective method for bias correction, focusing on the easily generalizable case of the Gini index as a measure of heterogeneity.
Sandri M., Zuccolotto P. (2010), Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms, Statistics and Computing, 20, 393-407.
Variable selection is one of the main problems faced by data mining and machine learning techniques. These techniques are often, more or less explicitly, based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. Firstly, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Secondly, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data.
Carpita M., Sandri M., Simonetto A., Zuccolotto P. (2014), Football Mining with R, Data Mining Applications with R (Yanchang Z., Yonghua C. eds.), chapter 14, Elsevier.
This chapter presents a data mining process for investigating the relationship between the outcome of a football match (win, lose or draw) and a set of variables describing the actions of each team, using the R environment and selected R packages for statistical computing. The analyses were implemented with parallel computing when possible. Our goals were to identify, from hundreds of covariates, those that most strongly affect the probability of winning a match and to construct a small number of composite indicators based on the most predictive variables. These two tasks were carried out using the Random Forest machine learning algorithm and Principal Component Analysis, respectively. Variable selection was performed using the novel approach developed by Sandri and Zuccolotto in 2008. Finally, we compared the results of several different classification models and algorithms (Random Forest,Classification Neural Network, K-Nearest Neighbor, Naïve Bayes classifier, and Multinomial Logit regression), assessing both their performance and the insightfulness of their results.
Carpita M., Sandri M., Simonetto A., Zuccolotto P. (2015), Discovering the Drivers of Football Match Outcomes with Data Mining, Quality Technology and Quantitative Management, 12, 4, 537-553.
In this paper the relationship between the outcome of a football match (win, lose or draw) and a set of variables describing the game actions is investigated across time, by analyzing data from 4 consecutive yearly championships. The aim of the study is to discover the factors leading to win the match. More precisely, the goal is to select, from hundreds of covariates, those that most strongly affect the probability of winning a match, to recognize regularities across time by identifying the variables whose importance is confirmed in different analyses, and finally to construct a small number of composite indicators to be interpreted as drivers of match outcome. These tasks are carried out using the Random Forest machine learning algorithm, in order to select the most important variables, and Principal Component Analysis, in order to summarize them into a small number of drivers. Variable selection is performed using the novel approach developed by Sandri and Zuccolotto (2008, 2010).
Nembrini S., König I. R., Wright M. N. (2018), The revival of the Gini importance?, Bioinformatics, 34, 21, 3711-3718.
Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency.
We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient.
The procedure is included in the R package ranger , available on CRAN.
Supplementary data are available at Bioinformatics online.