1.

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means). Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data.

COBISS.SI-ID: 30458841

2.

Using literature-based discovery to identify novel therapeutic approaches

We present a promising in silico paradigm called literature-based discovery (LBD) and describe its potential to identify novel pharmacologic approaches totreating diseases. The goal of LBD is to generate novel hypotheses by analyzing the vast biomedical literature. Additional knowledge resources, suchas ontologies and specialized databases, are often used to supplement the published literature. MEDLINE, the largest and most important biomedical bibliographic database, is the most common source for exploiting LBD. There are two variants of LBD, open discovery and closed discovery. With open discovery we can, for example, try to find a novel therapeutic approach for a given disease, or find new therapeutic applications for an existing drug. Withclosed discovery we can find an explanation for a relationship between twoconcepts. For example, if we already have a hypothesis that a particular drug is useful for a particular disease, with closed discovery we can identifythe mechanisms through which the drug could have a therapeutic effect on the disease. We briefly describe the methodology behind LBD and then discuss in more detail currently available LBD tools; we also mention in passing some of those no longer available. Next we present several examples inwhich LBD has been exploited for identifying novel therapeutic approaches. In conclusion, LBD is a powerful paradigm with considerable potential to complement more traditional drug discovery methods, especially for drug targetdiscovery and for existing drug relabeling.

COBISS.SI-ID: 677804

3.

SMOTE for high-dimensional class-imbalanced data

Background Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.Results While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.Conclusions In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

COBISS.SI-ID: 30528217

4.

Integration of data from omic studies with the literature-based discovery towards identification of novel treatments for neovascularization in diabetic retinopathy

Diabetic retinopathy (DR) is a secondary complication of diabetes associated with retinal neovascularization and represents the leading cause of blindness in the adult population in the developed world. Despite research efforts, the nature of pathogenetic processes leading toDR is still unknown, making development of novel effective treatments difficult.Advances in omic technologies now offer unprecedented insight into global molecular alterations in DR, but identification of novel treatments based on massive amounts of data generated in omic studies still represents a considerable challenge. For this reason, we attempted to facilitate discovery of novel treatments for DR by complementing the interpretation of omic results using the vast body of information existing in the published literature with the literature-based discovery (LBD) approaches. To achieve this, we collected data from transcriptomic studies performed on retinal tissue from animal models of DR, performed a meta-analysis of these datasets and identified altered genes and pathways. Using the SemBT LBD framework, we have determined which therapies could regulate perturbed pathways or that could stabilize the gene expression alterations in DR. We show that by using this approach, we not only could reidentify drugs currently in use or in clinical trials, but also could indicate novel treatment directions for ameliorating neovascularization processes in DR.

COBISS.SI-ID: 31259609

5.

Large-scale structure of a network of co-occurring MeSH terms

Concept associations can be represented by a network that consists of a set of nodes representing concepts and a set of edges representing their relationships. Complex networks exhibit some common topological features including small diameter, high degree of clustering, power-law degree distribution, and modularity. We investigated the topological properties of a network constructed from co-occurrences between MeSH descriptors in the MEDLINE database. We conducted the analysis on two networks, one constructed from all MeSH descriptors and another using only major descriptors. Network reduction was performed using the Pearson%s chi-square test for independence. To characterize topological properties of the network we adopted some specific measures, including diameter, average path length, clustering coefficient, and degree distribution. For the full MeSH network the average path length was 1.95 with a diameter of three edges and clustering coefficient of 0.26. The Kolmogorov-Smirnov test rejects the power law as a plausible model for degree distribution. For the major MeSH network the average path length was 2.63 edges with a diameter of seven edges and clustering coefficient of 0.15. The Kolmogorov-Smirnov test failed to reject the power law as a plausible model. The power-law exponent was 5.07. In both networks it was evident that nodes with a lower degree exhibit higher clustering than those with a higher degree. After simulated attack, where we removed 10% of nodes with the highest degrees, the giant component of each of the two networks contains about 90% of all nodes. Because of small average path length and high degree of clustering the MeSH network is small-world. A power-law distribution is not a plausible model for the degree distribution. The network is highly modular, highly resistant to targeted and random attack and with minimal dissortativity.

COBISS.SI-ID: 2048311059

J3-4246 — Final report

1.

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

2.

Using literature-based discovery to identify novel therapeutic approaches

3.

SMOTE for high-dimensional class-imbalanced data

4.

Integration of data from omic studies with the literature-based discovery towards identification of novel treatments for neovascularization in diabetic retinopathy

5.

Large-scale structure of a network of co-occurring MeSH terms