1.

Combinatorial algorithm for counting small induced graphs and orbits

We developed a new algorithm for counting the graphlet frequencies and orbits in sparse large graphs. The algorithm is applicable to many areas, in particular in the field of bioinformatics. Its time complexity is smaller than that of the existing algorithms for an order of magnitude; in practical terms, the execution times are hundred-fold shorter on the graphs we typically encounter in bioinformatics. We have previously published an algorithm for a limited setup, while in this paper we successfully generalized it to graphlets of arbitrary sizes.

COBISS.SI-ID: 1537349315

2.

Jumping across biomedical contexts using compressive data fusion

The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. We present an algorithm Medusa that builds on collective matrix factorization to derive different semantics. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics.

COBISS.SI-ID: 1537026243

3.

Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models

The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. In the paper we propose methods that quantify prediction confidence through estimation of the prediction error at the point of interest. Our experimental results indicate that these new alternative approaches can outperform standard reliability scores that rely only on similarity to compounds in the training set.

COBISS.SI-ID: 10466388

4.

Gene network inference by probabilistic scoring of relationships from a factorized model of interactions

Epistasis analysis is an essential tool of classical genetics. We propose a conceptually new probabilistic approach to gene network inference from quantitative interaction data. The approach is founded on joint treatment of the mutant phenotype data with a factorized model and probabilistic scoring of pairwise gene relationships that are inferred from the latent gene representation. In an experimental study, we show that the proposed approach can accurately reconstruct several known pathways and that it surpasses the accuracy of current approaches.

COBISS.SI-ID: 10624852

5.

Small network completion using frequent subnetworks

Prediction of missing or potential links and edges is currently the central theme in network analysis. We define a problem of small network completion, which deals with sets of small networks, possibly with no recorded temporal dynamics. This problem requires a different set of methods and evaluation procedures. We present a method named Hyspan that extracts frequent patterns from small networks and uses them to predict missing vertices and edges in new networks.

COBISS.SI-ID: 1536144835

J2-5480 — Final report

1.

Combinatorial algorithm for counting small induced graphs and orbits

2.

Jumping across biomedical contexts using compressive data fusion

3.

Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models

4.

Gene network inference by probabilistic scoring of relationships from a factorized model of interactions

5.

Small network completion using frequent subnetworks