Machine Learning for Survival Data

The project aims at studying machine learning algorithms for survival analysis, with main focus on survival trees and survival random forests. The topic presents still some unclear points, mainly dealing with performance assessment. Moreover, there is still not an harmonization of all the proposed methods. An analysis from both a theoretical and practical point of view is carried out with the aim of sheding light on the topic.

Scientific coordinators: Ambra Macis, Marica Manisera, Marco Sandri, Paola Zuccolotto.

Seminal papers

Survival trees: a pathway among features and open issues of the main R packages

Macis A., Survival trees: a pathway among features and open issues of the main R packages, EJASA (2022).

Supplementary File 1 – Tutorial in R

Supplementary File 2 – Structure of Trees

Supplementary File 3 – Script

Survival analysis aims to study the occurrence of a particular event during a follow-up period. Recently, many machine learning methods have been used for analyzing right-censored data. Among these, survival trees are a useful tool of recursive partitioning for defining homogeneous groups in terms of survival probability. However, there are still some unclear points on how to work with these methods from a practical point of view. Indeed, even if there are a lot of proposed methods, many of these present little documentation, mainly concerning the corresponding R functions. Moreover, there does not exist an harmonization of all these proposals. This work aims to shed light on the topic and to provide a practical guide for simulating survival data, fitting survival trees and evaluating their performance with the statistical software R.

PhD Thesis

Statistical Models and Machine Learning for Survival Data Analysis

Author: Ambra Macis

Supervisor: Paola Zuccolotto

Second Supervisor: Marco Sandri

Co-Supervisor: Marica Manisera

The main topic of this thesis is survival analysis, a collection of methods used in longitudinal studies in which the interest is not only in the occurrence (or not) of a particular event, but also in the time needed for observing it. Over the years, firstly statistical models and then machine learning methods have been proposed to address studies of survival analysis. The first part of the work provides an introduction to the basic concepts of survival analysis and an extensive review of the existing literature. In particular, the focus has been set on the main statistical models (nonparametric, semiparametric and parametric) and, among machine learning methods, on survival trees and random survival forests. For these methods the main proposals introduced during the last decades have been described. In the second part of the thesis, instead, my research contributions have been reported. These works mainly focused on two aims: (1) the rationalization into a unified protocol of the computational approach, which nowadays is based on several existing packages with few documentation, several still obscure points and also some bugs, and (2) the application of survival data analysis methods in an unusual context where, to our best knowledge, this approach had never been used. In particular, the first contribution consisted in the writing of a tutorial aimed to enable the interested users to approach these methods, making order among the many existing algorithms and packages and providing solutions to the several related computational issues. It dealt with the main steps to follow when a simulation study is carried out, paying attention to: (i) survival data simulation, (ii) model fitting and (iii) performance assessment. The second contribution was based on the application of survival analysis methods, both statistical models and machine learning algorithms, for analyzing the offensive performance of the National Basketball Association (NBA) players. In particular, variable selection has been performed for determining the main variables associated to the probability of exceeding a given amount of scored points during the post All-Stars game season segment and the time needed for doing it. Concluding, this thesis proposes to lay the ground for the development of a unified framework able to harmonize the existing fragmented approaches and without computational issues. Moreover, the findings of this thesis suggest that a survival analysis approach can be extended also to new contexts.

Statistical Models and Machine Learning for Survival Data Analysis