international scientific monograph, among chapters there is also a chapter from the project group
COBISS.SI-ID: 291555584
special issue of a scientific journal, most papers were contributed by project group members
COBISS.SI-ID: 286873088
The paper presents the current version of the Slovene corpus of netspeak Janes which contains tweets, forum posts, news comments, blogs and blog comments, and user and talk pages from Wikipedia. First, we describe the harvesting procedure for each data source and provide a quantitative analysis of the corpus. Next, we present automatic and manual procedures for enriching the corpus with metadata, such as user type, gender and region, and text sentiment and standardness level. Finally, we give a detailed account of the linguistic annotation workflow which includes tokenization, sentence segmentation, rediacritisation, normalization, morphosyntactic tagging and lemmatization.
COBISS.SI-ID: 62245218
Web texts are becoming increasingly relevant sources of information, with web corpora useful for corpus linguistic studies and development of language technologies. Even though web texts are directly accessable, which substantially simplifies the collection procedure compilation of web corpora is still complex, time consuming and expensive. It is crucial that similar endeavours are not repeated, which is why it is necessary to make the created corpora easily and widely accessible both to researchers and a wider audience. While this is logistically and technically a straightforward procedure, legal constraints, such as copyright, privacy and terms of use severely hinder the dissemination of web corpora. This paper discusses legal conditions and actual practice in this area, gives an overview of current practices and proposes a range of mitigation measures on the example of the Janes corpus of Slovene user-generated content in order to ensure free and open dissemination of Slovene web corpora.
COBISS.SI-ID: 62288994
key overview project publication about the creation and annotation of the corpus
COBISS.SI-ID: 64650338