1.

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means). Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.

COBISS.SI-ID: 30458841

2.

Using literature-based discovery to identify novel therapeutic approaches

We present a promising in silico paradigm called literature-based discovery (LBD) and describe its potential to identify novel pharmacologic approaches totreating diseases. The goal of LBD is to generate novel hypotheses by analyzing the vast biomedical literature. Additional knowledge resources, suchas ontologies and specialized databases, are often used to supplement the published literature. MEDLINE, the largest and most important biomedical bibliographic database, is the most common source for exploiting LBD. There are two variants of LBD, open discovery and closed discovery. With open discovery we can, for example, try to find a novel therapeutic approach for a given disease, or find new therapeutic applications for an existing drug. Withclosed discovery we can find an explanation for a relationship between twoconcepts. For example, if we already have a hypothesis that a particular drug is useful for a particular disease, with closed discovery we can identifythe mechanisms through which the drug could have a therapeutic effect on the disease. We briefly describe the methodology behind LBD and then discuss in more detail currently available LBD tools; we also mention in passing some of those no longer available. Next we present several examples inwhich LBD has been exploited for identifying novel therapeutic approaches. In conclusion, LBD is a powerful paradigm with considerable potential to complement more traditional drug discovery methods, especially for drug targetdiscovery and for existing drug relabeling.

COBISS.SI-ID: 677804

3.

SMOTE for high-dimensional class-imbalanced data

Background Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.Results While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.Conclusions In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

COBISS.SI-ID: 30528217

J3-4246 — Annual report 2013

1.

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

2.

Using literature-based discovery to identify novel therapeutic approaches

3.

SMOTE for high-dimensional class-imbalanced data