1.

Multi-word term extraction from comparable corpora by combining contextual and constituent clues

In this paper we present an approach to automatically extract and align multi-word terms from an English-Slovene comparable health corpus. First, the terms are extracted from the corpus for each language separately using a list of user-adjustable morphosyntactic patterns and a term weighting measure. Then, the extracted terms are aligned in a bag-of-equivalents fashion with a seed bilingual lexicon. In the extension of the approach we also show that thesmall general seed lexicon can be enriched with domain-specific vocabulary by harvesting it directly from the comparable corpus, which significantly improves the results of multi-word term mapping. While most previous efforts in bilingual lexicon extraction from comparable corpora have focused on mapping of single words, the proposed technique successfully augments them in that it is able to deal with multi-word terms as well. Since the proposed approach requires minimal knowledge resources, it is easily adaptable for a new language pair or domain, which is one of its biggest advantages.

B.03 Paper at an international scientific conference

COBISS.SI-ID: 49683298

2.

Learning to mine definitions from Slovene structured and unstructured knowledge-rich resources

The paper presents an innovative approach to extract Slovene definition candidates from domain specific corpora using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. First, a classification model was trained on examples from Slovene Wikipedia which was then used to find wellformed definitions among the extracted candidates.

B.03 Paper at an international scientific conference

COBISS.SI-ID: 43122530

3.

Verbal multiword expressions in Slovene

This paper discusses the building of a manually annotated training corpus of Slovene verbal multiword expressions, which was a part of PARSEME shared task that covered eighteen languages from various language families. In the course of the project, annotation guidelines were compiled, describing the notation scope in detail and proposing a multilingual system for verbal MWE categorisation. In this paper, we present the methods of identification, annotation scope and linguistic tests that determine structural, syntactic and lexical characteristics of the verbal MWE candidate lexical units

B.03 Paper at an international scientific conference

COBISS.SI-ID: 65967458

4.

sloWTool

sloWTool is an all-in-one wordnet tool that enables browsing, editing and visualization of wordnet content with hyperbolic graphs and images. It is freely available under the CC-BY-SA licence and based on MySQL and PHP technologies, which makes the tool light-weight and portable. It is browser-independent and allows quick queries. Scripts for automatic database transformations from and into several standardized formats, such as DEBVisDic XML and LMF, are provided so that a wordnet for another language can be imported at any time. The on-line browser is simple to use for non-experts but also enables advanced searching and view settings for expert users that can enter complex search queries and decide which fields to display as well as toggle between a mono- and a multilingual option.

F.15 Development of a new information system/databases

COBISS.SI-ID: 25364007

5.

Corpora of the Slovenian language then and now

The lecture at the University of Cornell showcased the development of corpora of Slovene as well as the development of corpus linguistics in Slovenia. We emphasized the significance of interdisciplinary research within this framework and provided the example of inter-faculty cooperation at the University of Ljubljana as a potential framework of cooperation. Modern, digital linguistics is ever more redrawing the boundaries between the individual sub-fields, which has brought about an entirely new way of doing research in the humanities. Consequently, linguistic descriptions are no longer purely the domain of linguists, but all participants in the process: from the preparation of language resources and their analysis to their presentation in the digital environment.

B.05 Guest lecturer at an institute/university

COBISS.SI-ID: 62246242

P6-0215 — Final report

1.

Multi-word term extraction from comparable corpora by combining contextual and constituent clues

2.

Learning to mine definitions from Slovene structured and unstructured knowledge-rich resources

3.

Verbal multiword expressions in Slovene

4.

sloWTool

5.

Corpora of the Slovenian language then and now