1.

Gradient boosting for high-dimensional prediction of rare events

In clinical research the goal is often to correctly estimate the probability of an event. For this purpose several characteristics of the patients are measured and used to develop a prediction model which can be used to predict the class membership for future patients. Ensemble classifiers are combinations of many different classifiers and they can be useful because combining a set of classifiers can result in more accurate predictions. Gradient boosting is an ensemble classifier which was shown to perform well in the setting where the number of variables exceeds the number of samples (high-dimensional data), however it has not been evaluated for the prediction of rare events. It is demonstrated that Gradient boosting suffers from severe rare events bias, correctly classifying only a small proportion of samples from the rare class. The bias can be removed by using subsampling in combination with appropriate amount of shrinkage but only for a specific number of boosting iterations and for binomial loss function. It is shown that the number of boosting iterations where the rare events bias is removed cannot be estimated efficiently from the training data when the sample size is small. Therefore several corrections for the rare events bias of Gradient boosting are proposed and evaluated by using simulated and real high-dimensional data. It is demonstrated that the proposed corrections successfully remove the rare events bias and outperform the other ensemble classifiers that...

COBISS.SI-ID: 32788953

2.

Firth's logistic regression with rare events

Firth's logistic regression has become a standard approach for the analysis of binary outcomes with small samples. Whereas it reduces the bias in maximum likelihood estimates of coefficients, bias towards one-half is introduced in the predicted probabilities. The stronger the imbalance of the outcome, the more severe is the bias in the predicted probabilities. We propose two simple modifications of Firth's logistic regression resulting in unbiased predicted probabilities. The first corrects the predicted probabilities by a post hoc adjustment of the intercept. The other is based on an alternative formulation of Firth's penalization as an iterative data augmentation procedure. Our suggested modification consists in introducing an indicator variable that distinguishes between original and pseudo-observations in the augmented data. In a comprehensive simulation study, these approaches are compared with other attempts to improve predictions based on Firth's penalization and to other published penalization strategies intended for routine use. For instance, we consider a recently suggested compromise between maximum likelihood and Firth's logistic regression. Simulation results are scrutinized with regard to prediction and effect estimation. We find that both our suggested methods do not only give unbiased predicted probabilities but also improve the accuracy conditional on explanatory variables compared with Firth's penalization. While one method results in effect estimates...

COBISS.SI-ID: 33134041

3.

What (not) to expect when classifying rare events

When building classifiers, it is natural to require that the classifier correctly estimates the event probability (Constraint 1), that it has equal sensitivity and specificity (Constraint 2) or that it has equal positive and negative predictive values (Constraint 3). We prove that in the balanced case, where there is equal proportion of events and non-events, any classifier that satisfies one of these constraints will always satisfy all. Such unbiasedness of events and non-events is much more difficult to achieve in the case of rare events, i.e. the situation in which the proportion of events is (much) smaller than 0.5. Here, we prove that it is impossible to meet all three constraints unless the classifier achieves perfect predictions. Any non-perfect classifier can only satisfy at most one constraint, and satisfying one constraint implies violating the other two constraints in a specific direction. Our results have implications for classifiers optimized using g-means or [Formula: see text]-measure, which tend to satisfy Constraints 2 and 1, respectively. Our results are derived from basic probability theory and illustrated with simulations based on some frequently used classifiers.

COBISS.SI-ID: 33010393

4.

Artificially generated near-infrared spectral data for classification purposes

Near-Infrared Spectroscopy has became a widely used analytical technique in different research fields due to its non-destructiveness and low-cost. The spectra are rich in information but extremely complex, therefore their analysis necessitates the use of advanced statistical methods. The empirical properties of the statistical methods can be assessed using artificially generated data that resemble real Near-Infrared Spectroscopy. In this paper we propose a new data generation approach (ABS) that takes into account the theoretical knowledge about the near-infrared absorption of the functional groups. The proposed method is compared to real data and to a simpler data generation method, which simulates the data from a multivariate normal distribution whose parameters are estimated from real data (MVNorig). The comparison between real data and the data generation approaches is based on a class-imbalanced classification problem using linear discriminant analysis, classification trees and support vector machines. Both simulation approaches generated spectra with a good resemblance to real data, MVNorig performing slightly better than ABS; using real and simulated data we would have reached similar conclusions about the class-imbalance problem in classification. Both methods can be used to artificially generate near-infrared spectra. The method based on multivariate normal distribution can be used when a large number of real data spectra are available, while the appropriateness of...

COBISS.SI-ID: 33505753

N1-0035 — Final report

1.

Gradient boosting for high-dimensional prediction of rare events

2.

Firth's logistic regression with rare events

3.

What (not) to expect when classifying rare events

4.

Artificially generated near-infrared spectral data for classification purposes