1.

Constructing a poor man's wordnet in a resource-rich world

In this paper we present a language-independent, fully modular and automatic approach to bootstrap a wordnet for a new language by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel corpora, and Wikipedia. The approach, which we apply here to Slovene, takes into account monosemous and polysemous words, general and specialised vocabulary as well as simple and multi-word lexemes. The extracted words are then assigned one or several synset ids, based on a classifier that relies on several features including distributional similarity. Finally, we identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic, manual and task-based evaluations show that the resulting resource, the latest version of the Slovene wordnet, is already a valuable source of lexico-semantic information.

COBISS.SI-ID: 56782434

2.

Construction and analysis of the JANES corpus of user-generated content

Web texts represent an increasing segment of language production both worldwide and in Slovenia. User-generated content is thus becoming an increasingly important source of knowledge and affects future language development. In order to harness this potential, it is necessary to conduct a thorough analysis of internet language use, which differs from traditional language production. The first step in this direction is the construction and analysis of the Janes corpus of internet Slovene, which is presented in this paper.

COBISS.SI-ID: 59017570

3.

Predicting the level of text standardness in user-generated content

Non-standard language as it appears in user-generated content has recently at- tracted much attention. This paper pro- poses that non-standardness comes in two basic varieties, technical and linguistic, and develops a machine-learning method to discriminate between standard and non- standard texts in these two dimensions. We describe the manual annotation of a dataset of Slovene user-generated content and the features used to build our re- gression models. We evaluate and dis- cuss the results, where the mean abso- lute error of the best performing method on a three-point scale is 0.38 for tech- nical and 0.42 for linguistic standard- ness prediction. Even when using no language-dependent information sources, our predictor still outperforms an OOV- ratio baseline by a wide margin. In addi- tion, we show that very little manually an- notated training data is required to perform good prediction. Predicting standardness can help decide when to attempt to nor- malise the data to achieve better annota- tion results with standard tools, and pro- vide linguists who are interested in non- standard language with a simple way of selecting only such texts for their research.

COBISS.SI-ID: 58338402

J6-6842 — Annual report 2015

1.

Constructing a poor man's wordnet in a resource-rich world

2.

Construction and analysis of the JANES corpus of user-generated content

3.

Predicting the level of text standardness in user-generated content