The paper presents the development, extension and cleaning of Slovene wordnet by reusing existing language resources. The initial induction of synsets and the subsequent extension of sloWNet are based on multilin- gual resources and were performed automatically. The cleaning of the developed lexicon, on the other hand, is based on a monolingual reference corpus and requires manual validation. Manual work is performed in sloWTool, a new browser, editor and visualizer of wordnet content. The developed wordnet and editor are freely available under the Creative Commons licence.
B.03 Paper at an international scientific conference
COBISS.SI-ID: 47786850Definition extraction is an emerging field of NLP research. This paper presents an innovative information extraction workflow aimed to extract definition candidates from domain-specific corpora, using morphosyntactic patterns, automatic terminology recognition and semantic tagging with wordnet senses. The workflow, implemented in a novel service-oriented workflow environment ClowdFlows, was applied to the task of definition extraction from two corpora of academic papers in the domain of Computational Linguistics, one in Slovene and another in English. The definition extraction workflow is available on-line, therefore it can be reused for definition extraction from other corpora and is easily adaptable to other languages provided that the needed language specific workflow components were accessible as public services on the web.
B.03 Paper at an international scientific conference
COBISS.SI-ID: 26151975The paper presents an approach to automatically extract and align multi-word terms from an English-Slovene comparable health corpus. First, the terms are extracted from the corpus for each language separately using a list of user-adjustable morphosyntactic patterns and a term weighting measure. Then, the extracted terms are aligned in a bag-of-equivalents fashion with a seed bilingual lexicon. In the extension of the approach we also show that the small general seed lexicon can be enriched with domain-specific vocabulary by harvesting it directly from the comparable corpus, which significantly improves the results of multi-word term mapping. While most previous efforts in bilingual lexicon extraction from comparable corpora have focused on mapping of single words, the proposed technique successfully augments them in that it is able to deal with multi-word terms as well. Since the proposed approach requires minimal knowledge resources, it is easily adaptable for a new language pair or domain, which is one of its biggest advantages.
B.03 Paper at an international scientific conference
COBISS.SI-ID: 49683298The paper reports on a series of experiments aimed at improving the machine translation of ambiguous lexical items by using wordnet-based unsupervised Word Sense Disambiguation (WSD) and comparing its results to three MT systems.Our experiments are performed for the English-Slovene language pair using UKB, a freely available graph-based word sense disambiguation system. Results are evaluated in three ways: a manual evaluation of WSD performance from MT perspective, an analysis of agreement between the WSDproposed equivalent and those suggested by the three systems, and finally by computing BLEU, NIST and METEOR scores for all translation versions. Our results show that WSD performs with a MT-relevant precision of 71% and that 21% of sense-related MT errors could be prevented by using unsupervised WSD.
B.03 Paper at an international scientific conference
COBISS.SI-ID: 49734242Lecture at the international EMUNI Translation Studies Doctoral Summer School (http://www.prevajalstvo.net/emuni-doctoral-summer-school), a joint project of 6 universities. University of Ljubljana, Boğaziçi University (Turkey), Turku University and University of Eastern Finland (Finland), Univeristy of Granada (Spain) and EMUNI University (Slovenija). Lecture in the methodology section on the limitations of quantitative methods in translation studies research.
B.05 Guest lecturer at an institute/university
COBISS.SI-ID: 49469282