The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions.
COBISS.SI-ID: 31087143
This paper discusses the building of a manually annotated training corpus of Slovene verbal multiword expressions, which was a part of PARSEME shared task that covered eighteen languages from various language families. In the course of the project, annotation guidelines were compiled, describing the notation scope in detail and proposing a multilingual system for verbal MWE categorisation. In this paper, we present the methods of identification, annotation scope and linguistic tests that determine structural, syntactic and lexical characteristics of the verbal MWE candidate lexical units. Furthermore, we highlight examples that specifically apply to the Slovene language. Tools and previously available data that were used in the project are also presented: an annotation tool and syntactically and morphosyntactically annotated training corpus for Slovene.
COBISS.SI-ID: 65967458
The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. About a quarter of the corpus is annotated with semantic role labels.
COBISS.SI-ID: 66454114