Loading...
Projects / Programmes source: ARIS

Conquering the Curse of Dimensionality by Using Background Knowledge

Research activity

Code Science Field Subfield
2.07.07  Engineering sciences and technologies  Computer science and informatics  Intelligent systems - software 

Code Science Field
P176  Natural sciences and mathematics  Artificial intelligence 

Code Science Field
1.02  Natural Sciences  Computer and information sciences 
Keywords
data mining, statistics, machine learning, dimensionality reduction, background knowledge
Evaluation (rules)
source: COBISS
Researchers (19)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  36469  PhD Niko Colnerič  Computer science and informatics  Junior researcher  2015 - 2016 
2.  23399  PhD Tomaž Curk  Computer science and informatics  Researcher  2013 - 2016 
3.  16324  PhD Janez Demšar  Computer science and informatics  Head  2013 - 2016 
4.  31035  MSc Marjana Erdelji  Computer science and informatics  Researcher  2015 
5.  35424  PhD Tomaž Hočevar  Computer science and informatics  Junior researcher  2015 - 2016 
6.  38462  Jernej Kernc  Computer science and informatics  Technical associate  2015 
7.  25792  PhD Minca Mramor  Human reproduction  Researcher  2013 - 2016 
8.  32042  PhD Matija Polajnar  Computer science and informatics  Junior researcher  2013 - 2014 
9.  38461  PhD Ajda Pretnar Žagar  Computer science and informatics  Technical associate  2015 
10.  33189  Anže Starič  Computer science and informatics  Junior researcher  2013 - 2016 
11.  29630  PhD Miha Štajdohar  Computer science and informatics  Researcher  2013 
12.  38464  Vesna Tanko  Computer science and informatics  Researcher  2015 
13.  30142  PhD Marko Toplak  Computer science and informatics  Researcher  2013 - 2016 
14.  37693  MSc Maja Vodopivec  Computer science and informatics  Researcher  2014 - 2015 
15.  23987  PhD Martin Vuk  Mathematics  Researcher  2013 - 2014 
16.  12536  PhD Blaž Zupan  Computer science and informatics  Researcher  2013 - 2016 
17.  30921  PhD Lan Žagar  Computer science and informatics  Researcher  2013 - 2015 
18.  32929  Jure Žbontar  Computer science and informatics  Researcher  2013 - 2015 
19.  35422  PhD Marinka Žitnik  Computer science and informatics  Researcher  2015 
Organisations (2)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0312  University Medical Centre Ljubljana  Ljubljana  5057272000  125 
2.  1539  University of Ljubljana, Faculty of Computer and Information Science  Ljubljana  1627023 
Abstract
We live in a data-driven society whose functioning depends on gathering and analyzing huge quantities of data. Since collecting and storing the data has become very cheap, we no longer observe small sets of well-chosen variables; we routinely collect large numbers of measurements for each data instance. This holds equally true for any field of human endeavor, from science, with, for instance, genome-wide sequencing and expression profiling, to business and economy, with, say, snapshots of share values or currency exchange rates being recorded at small time intervals. In principle, this should enable us to find much more complex and unexpected patterns in the data than before. In practice, this abundance of data is like a huge haystack and we lack efficient methods for finding the needles, or, worse still, for distinguishing between needles and straws. Formally, given the huge dimensionality of data, current data mining methods find a great number of models and patterns that fit the data equally well.  Although most of them are random, it is mathematically impossible to tell them from the true phenomena. We argue that this problem is inherent in the current approach to data mining, which mostly uses only data to construct new theories, a practice initially denounced as data fishing. So far it has managed to get around the problem by biasing the theories towards simplicity (e.g. using linear models, various regularizations, Occam’s principle ets.). In high-dimensional problems this approach fails due to many simple theories that fit the data equally well. We intend to research what we believe to be the only viable solution of the problem. Just as classical science does not build theories from observations alone, we believe that the search for models, patterns and visualizations in data mining should build on existing knowledge about the domain.  For the purpose of the project, this prior knowledge can take any machine-readable form that describes the relations between variables, for instance an ontology or network of entities corresponding to the variables, correlations between the variables observed in past experiments, rules explicitly given by the expert, or text documents related to the topic, which can be used to statistically relate the variables. Prior knowledge should be used in all phases of data mining process. We propose to develop methods for data transformation, that will, for instance, decrease the dimensionality of the problem by using the prior knowledge to construct new meaningful variables from the observed ones; note that this differs from traditional dimensionality reduction, which reduces the dimensionality of the data using the data itself. In visualization, we will develop methods for construction of useful visualization based on available background knowledge of the problem. Predictive modeling, especially in machine learning, involves a search through a huge space of models; again, this search can be guided to incorporate the known relations between the variables. Finally, prior knowledge can be used to choose from the huge number of found models and patterns that fit the data equally well. The project will borrow from recent advances in genetics that made the most progress in solving the dimensionality problems by using prior knowledge, and from statistical techniques for dimensionality reduction and machine-learning techniques for limiting the search space, none of which currently employs much of background knowledge. For this reason, the core of the project group consists of a PI, whose background is machine learning, and two members with PhDs in statistics and in medicine, in particular genetics. All methods will be implemented in open source data mining tools, so they will be available for practical use on real-world problems and serve as a test bed for immediate testing and improvement of all algorithms developed within the project.
Significance for science
The basic premise of the project proposal - which was also reflected in its title - was that the use of prior knowledge or background knowledge can improve the analysis of high-dimensional data. As a result of the work on this project, we no longer think in terms of distinguishing between "prior data" and "data" but prefer to consider these as different, heterogenous data sources that can be fused together. One of the most important scientific achievements of the project team was development of methods for data fusion that work on an (in principle) arbitrary number of data sources of any type that can be represented with matrices and connected into a graph. The technic is highly versatile and can be adapted to many different problems, as we demonstrated in a high number of well-cited works. Second, in the era of big data, networks are becoming a prominent structure for presenting the data, since data collection often describe sets of objects that can be (pairwise) related in different ways. As such - and in particular in the data fusion setup described above - networks played an important role in the project. We developed new techniques that allow us to run network analytic techniques that were impractical in the past due to their time complexity. A particular achievement was a fast combinatorial algorithm for counting graphlet orbits in large sparse networks. Scientific progress requires tools. The group continued the development of one of the most popular open source data mining platforms, Orange. In the past three years, it was extended with modules for working with extremely large (e.g. several terabytes) data, analysis of time series, spectral images, text mining, image embedding and many other methods related to, or at least tangential, to the work done within this project.
Significance for the country
Besides the core research team, plenty of work on the project was done by graduate students as well as undergraduate students who thus got an opportunity to experience some state-of-the-art scientific work. Most of the team members come from the Bioinformatics Laboratory at the Faculty of Computer Science, University of Ljubljana. The group has -- also thanks to this project -- become stronger, and we also obtained funding from other Slovenian and foreign agencies and companies. Bioinformatics Laboratory is currently one of the biggest and most productive Slovenian research groups in this field. Through seminars, workshops and other presentations we promoted Slovenian scientific achievements abroad. Team members were also active promoters of our field among adults and younger generations.
Most important scientific results Annual report 2013, 2014, 2015, final report
Most important socioeconomically and culturally relevant results Annual report 2014, 2015, final report
Views history
Favourite