1.

What (not) to expect when classifying rare events

When building classifiers, it is natural to require that the classifier correctly estimates the event probability (Constraint 1), that it has equal sensitivity and specificity (Constraint 2) or that it has equal positive and negative predictive values (Constraint 3). We prove that in the balanced case, where there is equal proportion of events and non-events, any classifier that satisfies one of these constraints will always satisfy all. Such unbiasedness of events and non-events is much more difficult to achieve in the case of rare events, i.e. the situation in which the proportion of events is (much) smaller than 0.5. Here, we prove that it is impossible to meet all three constraints unless the classifier achieves perfect predictions. Any non-perfect classifier can only satisfy at most one constraint, and satisfying one constraint implies violating the other two constraints in a specific direction. Our results have implications for classifiers optimized using g-means or [Formula: see text]-measure, which tend to satisfy Constraints 2 and 1, respectively. Our results are derived from basic probability theory and illustrated with simulations based on some frequently used classifiers.

COBISS.SI-ID: 33010393

2.

Analysis of Slovenian research community through bibliographic networks

Science is a societal process, designed on widely accepted general rules which facilitate its development. Productive researchers are viewed from the perspective of a social network of their interpersonal relations. In this paper we address performance of Slovenian research community using bibliographic networks between the years 1970 and 2015 from various aspects which determine prolific science. We focus on basic determinants of research performance including productivity, collaboration, internationality, and interdisciplinarity. For each of the determinants, we select a set of statistics and network measures to investigate the state of each in every year of the analyzed period. The analysis is based on high quality data from manually curated information systems. We interpret the results by relating them to important historical events impacting Slovenia and to domestic expenditure for research and development. Our results clearly demonstrate causal relations between the performance of research community and changes in wider society. Political and financial stability together with concise measuring of scientific productivity established soon after Slovenia won independence from Yugoslavia in 1991 had positive influence on all determinants. They were further leveraged by foundation of Slovenian research agency and joining EU and NATO. Publish and perish phenomenon, negative impacts of financial crisis in 2008%2014 and reshaping the domestic expenditure for research and development after 2008 have also clear response in scientific community. In the paper, we also study the researcher%s career productivity cycles and present the analysis of the career productivity for all registered researchers in Slovenia.

COBISS.SI-ID: 2048412691

3.

A note on bias of measures of explained variation for survival data

Papers evaluating measures of explained variation, or similar indices, almost invariably use independence from censoring as the most important criterion. And they always end up suggesting that some measures meet this criterion, and some do not, most of the time leading to a conclusion that the first is better than the second. As a consequence, users are offered measures that cannot be used with time-dependent covariates and effects, not to mention extensions to repeated events or multi-state models. We explain in this paper that the aforementioned criterion is of no use in studying such measures, because it simply favors those that make an implicit assumption of a model being valid everywhere. Measures not making such an assumption are disqualified, even though they are better in every other respect. We show that if these, allegedly inferior, measures are allowed to make the same assumption, they are easily corrected to satisfy the 'independent-from-censoring' criterion. Even better, it is enough to make such an assumption only for the times greater than the last observed failure time T, which, in contrast with the 'preferred' measures, makes it possible to use all the modeling flexibility up to T and assume whatever one wants after T. As a consequence, we claim that some of the measures being preferred as better in the existing reviews are in fact inferior.

COBISS.SI-ID: 32214489

4.

Analyzing disease recurrence with missing at risk information

When analyzing time to disease recurrence, we sometimes need to work with data where all the recurrences are recorded, but no information is available on the possible deaths. This may occur when studying diseases of benign nature where patients are only seen at disease recurrences or in poorly-designed registries of benign diseases or medical device implantations without sufficient patient identifiers to obtain their dead/alive status at a later date. When the average time to disease recurrence is long enough in comparison with the expected survival of the patients, statistical analysis of such data can be significantly biased. Under the assumption that the expected survival of an individual is not influenced by the disease itself, general population mortality tables may be used to remove this bias. We show why the intuitive solution of simply imputing the patient's expected survival time does not give unbiased estimates of the usual quantities of interest in survival analysis and further explain that cumulative incidence function analysis does not require additional assumptions on general population mortality. We provide an alternative framework that allows unbiased estimation and introduce two new approaches: an iterative imputation method and a mortality adjusted at risk function. Their properties are carefully studied, with the results supported by simulations and illustrated on a real-world example.

COBISS.SI-ID: 32255193

5.

Link prediction on a network of co-occurring MeSH terms

Objectives: Literature-based discovery (LBD) is a text mining methodology for automatically generating research hypotheses from existing knowledge. We mimic the process of LBD as a classification problem on a graph of MeSH terms. We employ unsupervised and supervised link prediction methods for predicting previously unknown connections between biomedical concepts. Methods: We evaluate the effectiveness of link prediction through a series of experiments using a MeSH network that contains the history of link formation between biomedical concepts. We performed link prediction using proximity measures, such as common neighbor (CN), Jaccard coefficient (JC), Adamic/Adar index (AA) and preferential attachment (PA). Our approach relies on the assumption that similar nodes are more likely to establish a link in the future. Results: Applying an unsupervised approach, the AA measure achieved the best performance in terms of area under the ROC curve (AUC=0.76), followed by CN, JC, and PA. In a supervised approach, we evaluate whether proximity measures can be combined to define a model of link formation across all four predictors. We applied various classifiers, including decision trees, k-nearest neighbors, logistic regression, multilayer perceptron, naïve Bayes, and random forests. Random forest classifier accomplishes the best performance (AUC=0.87). Conclusions: The link prediction approach proved to be effective for LBD processing. Supervised statistical learning approaches clearly outperform an unsupervised approach to link prediction.

COBISS.SI-ID: 32835801

P3-0154 — Interim report

1.

What (not) to expect when classifying rare events

2.

Analysis of Slovenian research community through bibliographic networks

3.

A note on bias of measures of explained variation for survival data

4.

Analyzing disease recurrence with missing at risk information

5.

Link prediction on a network of co-occurring MeSH terms