Projects / Programmes
Analysis of heterogeneous information networks for knowledge discovery in life-sciences
Code |
Science |
Field |
Subfield |
7.00.00 |
Interdisciplinary research |
|
|
Code |
Science |
Field |
P176 |
Natural sciences and mathematics |
Artificial intelligence |
Code |
Science |
Field |
1.02 |
Natural Sciences |
Computer and information sciences |
Data mining, knowledge discovery, semantic data mining, workflows, heteregenous networks, plant immune signalling
Researchers (16)
Organisations (2)
Abstract
The proposal addresses knowledge discovery in complex data mining scenarios in life-sciences. With the development of high-throughput molecular biology techniques the data generated are getting into the range of so-called Big Data. Information relevant to a certain biological question is scattered in different public resources in heterogeneous formats and in the form inaccessible to typical biologists. To circumvent this situation, we need to fuse this information into a unique data source to be mined. The aim of the proposed project is to develop, implement, evaluate and apply a new methodology for analyzing large heterogeneous data in the area of life-sciences. The development of the proposed methodology is motivated by a tremendous increase in data generation within life-sciences research, while the means for explanatory knowledge discovery from these large heterogeneous data sources is still lagging behind. We aim to improve the existing data analysis approaches by extending and combining text mining, relational data mining and information fusion methods. In order to evaluate the proposed methodology we will use several benchmark and real-world problems in the area of life-sciences, aiming to advance translational research in agriculture by extracting novel knowledge on plant immune signaling.
The project has the following objectives:
1. Development of a new methodology, which will enable fusing texts and complex relational background knowledge into the form of a large heterogeneous information network. This will be achieved by extending our own methodology for mining heterogeneous information networks through contextualizing the information on data instances in terms of available semantic background knowledge (domain taxonomies and ontologies), and by adapting the methodology to big data and complex life-science scenarios.
2. Implementation of the methodology in the ClowdFlows or TextFlows and experimental evaluation of the proposed methodology on publicly available benchmark data sets, including selected medical problems for which large public heterogeneous data sets exist.
3. Application of the methodology to three life-science application scenarios: (i) cross-domain knowledge discovery from documents from two unrelated life-science problems, aiming to uncover yet unknown relations between "redox status" and "plant immune signaling", (ii) mining a time stamped stream of heterogeneous experimental data in the domain of plant immune signaling, and (iii) identification of key components in plant immune signaling determining the outcome of a disease.
The project will contribute to the development of new algorithms for mining large heterogeneous data. Accessibility of the developed methodology will be ensured by implementing the methodology in one of our web data mining platforms ClowdFlows or TextFlows, which will enable the use of the developed technology to the broader research audience and increase its relevance also for life science experts. The research will be performed in close collaboration of data mining experts from JSI with domain experts from NIB.
Significance for science
This project addresses the open problem of assisting scientists with the increasingly daunting task of heterogeneous and distributed information fusion and knowledge discovery. Solving this problem requires the development of a new computational paradigm that integrates ideas from different supporting domains. An adequate solution to this problem will result in new technologies that are relevant to a range of applications, some of which are also mentioned in the EU FP7 ICT work programme, such as Challenge 4 on Content and Challenge 5 on Healthcare. It covers issues such as knowledge management and creation, but goes beyond them in assisting users (particularly scientists) in knowledge discovery across distributed information repositories.
The project will advance the state-of-the-art by developing a framework for mining heterogeneous information networks, new data mining algorithms and a new approach to interactively formulate and refine powerful knowledge discovery workflows. Evidently, the proposed project solves an open problem and it is clearly pursing a long term objective with a high technological potential.
Successful results of the MinHIN project can contribute to Europe’s knowledge industry enabling it to become more effective, efficient and competitive. The challenges addressed by the MinHIN project cannot be adequately addressed with existing ICT methodologies or their incremental improvements since the methods developed within MinHIN will be substantially different from existing information fusion and knowledge discovery technologies and will require the collaboration of scientists with diverse backgrounds to tackle challenges in innovative information fusion, data mining, distributed information retrieval, and sophisticated user interfaces. A successful outcome of the project may have, firstly, a significant impact on the data mining technology and on science, and in a longer term, when adapted to knowledge discovery, also a considerable impact on the ability of Europe’s private and public sector in public data analysis.
The proposed MinHIN project has the potential to implement and demonstrate a paradigm shift in information and knowledge management, discovery, fusion and understanding. The MinHIN prototype will establish a strong scientific and technological basis for a broader, interdisciplinary research community as well as help cultivating the underlying methodologies to a level at which it can attract investment from industry, especially in the pharmaceutical and biotechnology sector.
Significance for the country
Since the project aims at analysis of heterogeneous information networks of potato the project results will directly influnce food industry. Potato is currently the third most important food crop world-wide. It produces high amounts of non-allergic vegetable proteins per hectare and contains many vitamins and health promoting compounds and has thus an increasing significance in the developing world as food crop. Yet its production is currently not optimal due to the high input costs during cultivation needed to achieve appropriate yield and susceptibility to biotic and abiotic factors. EU potato industry is very competitive and is continuously gaining shares worldwide. Hundreds of cultivars are used, many with close cultural and regional ties. In Slovenia, in the 80s, the PVY epidemic completely eliminated sensitive, but at that time leading, potato cultivars which virtually terminated Slovenian seed potato production. Currently there are only a few completely resistant cultivars, but their growing is problematic due to specific Slovenian climate as well as from the perspective of genetic diversity. The research findings of the proposed project will be a basis for precision breeding of environment resilient cultivars.
Most important scientific results
Interim report,
final report
Most important socioeconomically and culturally relevant results
Interim report,
final report