Loading...
Projects / Programmes source: ARIS

Language-independent methods for automatic construction of semantic lexicons from comparable corpora

Research activity

Code Science Field Subfield
6.05.00  Humanities  Linguistics   

Code Science Field
H350  Humanities  Linguistics 

Code Science Field
6.02  Humanities  Languages and Literature 
Keywords
semantic lexicons, wordnet, corpora, language resources for Slovene, lexical semantics, automatic methods
Evaluation (rules)
source: COBISS
Researchers (1)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  26294  PhD Darja Fišer  Linguistics  Head  2010 - 2012 
Organisations (1)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0581  University of Ljubljana, Faculty of Arts  Ljubljana  1627058  15 
Abstract
As the amount and importance of electronic documents are increasing, efficient handling of them without computer support is becoming unfeasible. That is why numerous computer applications have been developed that classify documents into groups according to their content, retrieve information from large text collections, generate abstracts of long texts, translate documents from one language to another, etc. However, such technological solutions require a certain degree of language understanding. This can be achieved with collections in which human knowledge is organized in a way that enables computers to access the meaning of words and phrases and understand the relations between them. One of the most popular concept-based lexicons that organizes concepts into a network with lexical and semantic relations is wordnet (Fellbaum 1998). It was first developed for English, after which wordnets for more than 50 other languages were constructed. Among them is a wordnet for Slovene which I developed durimg my PhD. I wish to further work on the Slovene semantic lexicon with the proposed post-doctoral research, the aim of which is the development of a methodology for the construction of wordnets from comparable corpora. They are becoming an increasingly popular resource in computational, corpus and contrastive linguistics, as well as translation studies. While parallel corpora are a preferable resource, only a few are available and are typically very limited in size, language pairs and domains (McEnery and Xiao 2006). In the post-doctoral project I propose to focus on Wikipedia as a comparable corpus. Under the premise that articles which describe the same concept use very similar words to do so, it is possible to take advantage of the standardized article structure, keywordness, cognates and other statistical measures in order to estimate translation equivalence between words in different languages and extract a multilingual lexicon from the comparable corpus (Sharoff 2008). Such an approach can handle monosemous as well as polysemous vocabulary and is suitable for multi-word terms and named entities as well, which change the too rapidly to be suitably represented by traditional dictionaries and glossaries. The proposed post-doctoral research project consists of several phases. In the first phase, I will transform Wikipedia into a multilingual comparable corpus. The second phase of the project is the extraction of the multilingual lexicon from the corpus from the Slovene part of Wikipedia. The extracted lexicon entries will then be assigned a wordnet id and Slovene synsets will be generated. In the third part of the project, the method and the constructed resource will be evaluated in an application for automatic word-sense disambiguation. The results of the evaluation will give insight into the suitability of the constructed Slovene semantic lexicon for practical tasks. To my knowledge, there has been no research into comparable corpora in the field of Slovene lexical semantics. This is why the proposed project presents an important milestone in Slovene corpus linguistics as well as human language technologies. Not only will the result of the project be an established, tested and language-independent methodology for the extraction of translation equivalents from comparable corpora; the project will also bring a highly palpable result in the form of a Slovene semantic lexicon that is aligned to wordnets in many other languages and therefore useful for mono- as well as multilingual applications. In this way, the developed wordnet will bridge the gap in the field of Slovene language resources and provide the foundations for a broader, semantically-enriched exploitation of Slovene corpus resources.
Significance for science
Since there has been no research into comparable corpora in the field of Slovene lexical semantics, the completed project presents a major milestone in Slovene corpus linguistics as well natural language processing. The importance of the results of the proposed project is two-fold: (1) the project resulted in an established, tested and language-independent methodology for the extraction of translation equivalents from comparable corpora; and (2) the semantic lexicon for Slovene has been released. It is aligned to wordnets in many other languages and is therefore useful for mono- as well as multilingual applications. In this way, the developed wordnet is narrowing the gap in the field of Slovene language resources and provide the foundations for a broader, semantically-enriched exploitation of Slovene corpus resources. The key findings based on the completed research are: - For successful extraction of translation equivalents, custom-made comparable corpora are not required. Instead, web corpora, which already exist for a number of languages or can be build relatively easily, suffice. - For successful extraction of translation equivalents, the same amount of data for both languages, which is extremely difficult to achieve for most languages, is not required. Instead, given a careful selection of statistical similarity measures, substantially more data may be available for one language (e.g. English) than for the other (e.g. Slovene) with equal success. - For successful extraction of translation equivalents from closely related languages a seed dictionary, which is unavailable for many language pairs, is not required. Instead, it can be induced directly from the comparable corpus by taking into account lexical overlap and other similarities between the two languages. - According to the main premise of distributional semantics, words with a similar meaning appear in similar contexts. This allows us not only to identify translation equivalents, but also false friends, which are orthographically very similar in both languages but normally have very different frequencies in corpora and very different context vectors. Based on this principle we have also developed an efficient methodology for automatic identification of false friends, which are useful in human language technologies as well as in lexicography and language pedagogy. The impact of the first semantic lexicon for Slovene, developed within the proposed post-doctoral project, extends beyond the host institution, since the all the created resources are freely available for all other researchers under the Creative Commons licence which has so far not been common practice in Slovenia.
Significance for the country
Semantic lexicons such as wordnet have been used in several different tasks and for several different purposes by researchers in the academia as well as by the industry. Among the companies to have gained the most advantage with the use of wordnet is Google. They have used wordnet for refining web searching and targeted advertising, and example that could well be followed by any Slovene company that offers web services. A direct way to use the developed wordnet is its integration into an application for automatic word-sense disambiguation and semantically-aware web services, such as (cross-language) information retrieval, question-answering and machine translation. Such applications would be highly welcome in a society, based on knowledge and with highly evolved information technology because Slovene language is still lagging behind its other European counterparts in this field.
Most important scientific results Annual report 2010, 2011, final report, complete report on dLib.si
Most important socioeconomically and culturally relevant results Annual report 2010, 2011, final report, complete report on dLib.si
Views history
Favourite